AysWu9WB

· 5 years ago · Feb 29, 2020, 12:24 PM
1Abstract: Recent progress in artiﬁcial intelligence has renewed interest in building systems that learn and think like people. Many
2advances have come from using deep neural networks trained end-to-end in tasks such as object recognition, video games, and board
3games, achieving performance that equals or even beats that of humans in some respects. Despite their biological inspiration and
4performance achievements, these systems differ from human intelligence in crucial ways. We review progress in cognitive science
5suggesting that truly human-like learning and thinking machines will have to reach beyond current engineering trends in both what
6they learn and how they learn it. Speciﬁcally, we argue that these machines should (1) build causal models of the world that support
7explanation and understanding, rather than merely solving pattern recognition problems; (2) ground learning in intuitive theories of
8physics and psychology to support and enrich the knowledge that is learned; and (3) harness compositionality and learning-to-learn to
9rapidly acquire and generalize knowledge to new tasks and situations. We suggest concrete challenges and promising routes toward
10these goals that can combine the strengths of recent neural network advances with more structured cognitive models.
11
121. Introduction
13Artiﬁcial intelligence (AI) has been a story of booms and
14busts, yet by any traditional measure of success, the last
15few years have been marked by exceptional progress.
16Much of this progress has come from recent advances in
17“deep learning,” characterized by learning large neural
18network-style models with multiple layers of representation
19(see Glossary in Table 1). These models have achieved
20remarkable gains in many domains spanning object recognition, speech recognition, and control (LeCun et al.
212015; Schmidhuber 2015). In object recognition, Krizhevsky et al. (2012) trained a deep convolutional neural
22network (ConvNet [LeCun et al. 1989]) that nearly
23halved the previous state-of-the-art error rate on the
24most challenging benchmark to date. In the years since,
25ConvNets continue to dominate, recently approaching
26human-level performance on some object recognition
27benchmarks (He et al. 2016; Russakovsky et al. 2015;
28Szegedy et al. 2014). In automatic speech recognition,
29hidden Markov models (HMMs) have been the leading
30approach since the late 1980s (Juang & Rabiner 1990),
31yet this framework has been chipped away piece by piece
32and replaced with deep learning components (Hinton
33et al. 2012). Now, the leading approaches to speech recognition are fully neural network systems (Graves et al. 2013;
34Hannun et al. 2014). Ideas from deep learning have also
35been applied to learning complex control problems. Mnih
36et al. (2015) combined ideas from deep learning and reinforcement learning to make a “deep reinforcement learning” algorithm that learns to play large classes of simple
37video games from just frames of pixels and the game
38score, achieving human- or superhuman-level performance
39on many of them (see also Guo et al. 2014; Schaul et al.
402016; Stadie et al. 2016).
41These accomplishments have helped neural networks
42regain their status as a leading paradigm in machine learning, much as they were in the late 1980s and early 1990s.
43The recent success of neural networks has captured attention beyond academia. In industry, companies such as
44Google and Facebook have active research divisions exploring these technologies, and object and speech recognition
45systems based on deep learning have been deployed in
46core products on smart phones and the web. The media
47have also covered many of the recent achievements of
48neural networks, often expressing the view that neural networks have achieved this recent success by virtue of their
49brain-like computation and, therefore, their ability to
50emulate human learning and human cognition.
51
52In this article, we view this excitement as an opportunity
53to examine what it means for a machine to learn or think
54like a person. We ﬁrst review some of the criteria previously
55offered by cognitive scientists, developmental psychologists, and artiﬁcial intelligence (AI) researchers. Second,
56we articulate what we view as the essential ingredients for
57building a machine that learns or thinks like a person, synthesizing theoretical ideas and experimental data from
58research in cognitive science. Third, we consider contemporary AI (and deep learning in particular) in the light of
59these ingredients, ﬁnding that deep learning models have
60yet to incorporate many of them, and so may be solving
61some problems in different ways than people do. We end
62by discussing what we view as the most plausible paths
63toward building machines that learn and think like
64people. This includes prospects for integrating deep learning with the core cognitive ingredients we identify, inspired
65in part by recent work fusing neural networks with lowerlevel building blocks from classic psychology and computer
66science (attention, working memory, stacks, queues) that
67have traditionally been seen as incompatible.
68Beyond the speciﬁc ingredients in our proposal, we draw
69a broader distinction between two different computational
70approaches to intelligence. The statistical pattern recognition approach treats prediction as primary, usually in the
71context of a speciﬁc classiﬁcation, regression, or control
72task. In this view, learning is about discovering features
73that have high-value states in common – a shared label in
74a classiﬁcation setting or a shared value in a reinforcement
75learning setting – across a large, diverse set of training data.
76The alternative approach treats models of the world as
77primary, where learning is the process of model building.
78Cognition is about using these models to understand the
79world, to explain what we see, to imagine what could
80have happened that didn’t, or what could be true that
81isn’t, and then planning actions to make it so. The difference between pattern recognition and model building,
82between prediction and explanation, is central to our view
83of human intelligence. Just as scientists seek to explain
84nature, not simply predict it, we see human thought as fundamentally a model building activity. We elaborate this key
85point with numerous examples below. We also discuss how
86pattern recognition, even if it is not the core of intelligence,
87can nonetheless support model building, through “modelfree” algorithms that learn through experience how to
88make essential inferences more computationally efﬁcient.
89Before proceeding, we provide a few caveats about the
90goals of this article, and a brief overview of the key ideas.
911.1. What this article is not
92
93For nearly as long as there have been neural networks,
94there have been critiques of neural networks (Crick 1989;
95Fodor & Pylyshyn 1988; Marcus 1998, 2001; Minsky &
96Papert 1969; Pinker & Prince 1988). Although we are critical of neural networks in this article, our goal is to build on
97their successes rather than dwell on their shortcomings. We
98see a role for neural networks in developing more humanlike learning machines: They have been applied in compelling ways to many types of machine learning problems,
99demonstrating the power of gradient-based learning and
100deep hierarchies of latent variables. Neural networks also
101have a rich history as computational models of cognition
102(McClelland et al. 1986; Rumelhart et al. 1986b). It is a
103history we describe in more detail in the next section. At a
104more fundamental level, any computational model of learning must ultimately be grounded in the brain’s biological
105neural networks.
106We also believe that future generations of neural
107networks will look very different from the current stateof-the-art neural networks. They may be endowed with
108intuitive physics, theory of mind, causal reasoning, and
109other capacities we describe in the sections that follow.
110More structure and inductive biases could be built into
111the networks or learned from previous experience with
112related tasks, leading to more human-like patterns of learning and development. Networks may learn to effectively
113search for and discover new mental models or intuitive
114theories, and these improved models will, in turn, enable
115subsequent learning, allowing systems that learn-tolearn – using previous knowledge to make richer inferences
116from very small amounts of training data.
117It is also important to draw a distinction between AI that
118purports to emulate or draw inspiration from aspects of
119human cognition and AI that does not. This article focuses
120on the former. The latter is a perfectly reasonable and
121useful approach to developing AI algorithms: avoiding cognitive or neural inspiration as well as claims of cognitive or
122neural plausibility. Indeed, this is how many researchers
123have proceeded, and this article has little pertinence to
124work conducted under this research strategy.1 On the other
125hand, we believe that reverse engineering human intelligence
126can usefully inform AI and machine learning (and has already
127done so), especially for the types of domains and tasks that
128people excel at. Despite recent computational achievements,
129people are better than machines at solving a range of difﬁcult
130computational problems, including concept learning, scene
131understanding, language acquisition, language understanding, speech recognition, and so on. Other human cognitive
132abilities remain difﬁcult to understand computationally,
133including creativity, common sense, and general-purpose reasoning. As long as natural intelligence remains the best
134example of intelligence, we believe that the project of
135reverse engineering the human solutions to difﬁcult computational problems will continue to inform and advance AI.
136Finally, whereas we focus on neural network approaches
137to AI, we do not wish to give the impression that these are
138the only contributors to recent advances in AI. On the contrary, some of the most exciting recent progress has been in
139new forms of probabilistic machine learning (Ghahramani
1402015). For example, researchers have developed automated statistical reasoning techniques (Lloyd et al. 2014),
141automated techniques for model building and selection
142(Grosse et al. 2012), and probabilistic programming languages (e.g., Gelman et al. 2015; Goodman et al. 2008;
143Mansinghka et al. 2014). We believe that these approaches
144will play important roles in future AI systems, and they are
145at least as compatible with the ideas from cognitive science
146we discuss here. However, a full discussion of those connections is beyond the scope of the current article.
147
1481.2. Overview of the key ideas
149
150The central goal of this article is to propose a set of core
151ingredients for building more human-like learning and
152thinking machines. We elaborate on each of these ingredients and topics in Section 4, but here we brieﬂy overview
153the key ideas.
154The ﬁrst set of ingredients focuses on developmental
155“start-up software,” or cognitive capabilities present early
156in development. There are several reasons for this focus
157on development. If an ingredient is present early in development, it is certainly active and available well before a
158child or adult would attempt to learn the types of tasks discussed in this paper. This is true regardless of whether the
159early-present ingredient is itself learned from experience or
160innately present. Also, the earlier an ingredient is present,
161the more likely it is to be foundational to later development
162and learning.
163We focus on two pieces of developmental start-up software (see Wellman & Gelman [1992] for a review of
164both). First is intuitive physics (sect. 4.1.1): Infants
165have primitive object concepts that allow them to track
166objects over time and to discount physically implausible
167trajectories. For example, infants know that objects will
168persist over time and that they are solid and coherent.
169Equipped with these general principles, people can
170learn more quickly and make more accurate predictions.
171Although a task may be new, physics still works the same
172way. A second type of software present in early development is intuitive psychology (sect. 4.1.2): Infants understand that other people have mental states like goals
173and beliefs, and this understanding strongly constrains
174their learning and predictions. A child watching an
175expert play a new video game can infer that the avatar
176has agency and is trying to seek reward while avoiding
177punishment. This inference immediately constrains
178other inferences, allowing the child to infer what
179objects are good and what objects are bad. These types
180of inferences further accelerate the learning of new tasks.
181Our second set of ingredients focus on learning. Although
182there are many perspectives on learning, we see model building as the hallmark of human-level learning, or explaining
183observed data through the construction of causal models of
184the world (sect. 4.2.2). From this perspective, the earlypresent capacities for intuitive physics and psychology are
185also causal models of the world. A primary job of learning is
186to extend and enrich these models and to build analogous
187causally structured theories of other domains.
188Compared with state-of-the-art algorithms in machine
189learning, human learning is distinguished by its richness
190and its efﬁciency. Children come with the ability and the
191desire to uncover the underlying causes of sparsely
192observed events and to use that knowledge to go far
193beyond the paucity of the data. It might seem paradoxical
194that people are capable of learning these richly structured
195models from very limited amounts of experience. We
196suggest that compositionality and learning-to-learn are
197ingredients that make this type of rapid model learning possible (sects. 4.2.1 and 4.2.3, respectively).
198A ﬁnal set of ingredients concerns how the rich models
199our minds build are put into action, in real time (sect.
2004.3). It is remarkable how fast we are to perceive and to
201act. People can comprehend a novel scene in a fraction of
202a second, or a novel utterance in little more than the
203
204time it takes to say it and hear it. An important motivation
205for using neural networks in machine vision and speech
206systems is to respond as quickly as the brain does. Although
207neural networks are usually aiming at pattern recognition
208rather than model building, we discuss ways in which
209these “model-free” methods can accelerate slow modelbased inferences in perception and cognition (sect. 4.3.1)
210(see Glossary in Table 1). By learning to recognize patterns
211in these inferences, the outputs of inference can be predicted without having to go through costly intermediate
212steps. Integrating neural networks that “learn to do inference” with rich model building learning mechanisms
213offers a promising way to explain how human minds can
214understand the world so well and so quickly.
215We also discuss the integration of model-based and
216model-free methods in reinforcement learning (sect.
2174.3.2.), an area that has seen rapid recent progress. Once
218a causal model of a task has been learned, humans can
219use the model to plan action sequences that maximize
220future reward. When rewards are used as the metric for
221successs in model building, this is known as model-based
222reinforcement learning. However, planning in complex
223models is cumbersome and slow, making the speedaccuracy trade-off unfavorable for real-time control. By
224contrast, model-free reinforcement learning algorithms,
225such as current instantiations of deep reinforcement learning, support fast control, but at the cost of inﬂexibility and
226possibly accuracy. We review evidence that humans
227combine model-based and model-free learning algorithms
228both competitively and cooperatively and that these interactions are supervised by metacognitive processes. The
229sophistication of human-like reinforcement learning has
230yet to be realized in AI systems, but this is an area where
231crosstalk between cognitive and engineering approaches
232is especially promising.
2332. Cognitive and neural inspiration in artiﬁcial
234intelligence
235The questions of whether and how AI should relate to
236human cognitive psychology are older than the terms artiﬁcial intelligence and cognitive psychology. Alan Turing
237suspected that it was easier to build and educate a childmachine than try to fully capture adult human cognition
238(Turing 1950). Turing pictured the child’s mind as a notebook with “rather little mechanism and lots of blank
239sheets,” and the mind of a child-machine as ﬁlling in the
240notebook by responding to rewards and punishments,
241similar to reinforcement learning. This view on representation and learning echoes behaviorism, a dominant psychological tradition in Turing’s time. It also echoes the strong
242empiricism of modern connectionist models – the idea
243that we can learn almost everything we know from the statistical patterns of sensory inputs.
244Cognitive science repudiated the oversimpliﬁed behaviorist view and came to play a central role in early AI
245research (Boden 2006). Newell and Simon (1961) developed their “General Problem Solver” as both an AI algorithm and a model of human problem solving, which
246they subsequently tested experimentally (Newell &
247Simon 1972). AI pioneers in other areas of research
248explicitly referenced human cognition and even published
249papers in cognitive psychology journals (e.g., Bobrow &
250Winograd 1977; Hayes-Roth & Hayes-Roth 1979; Winograd 1972). For example, Schank (1972), writing in the
251journal Cognitive Psychology, declared that “We hope to
252be able to build a program that can learn, as a child
253does, how to do what we have described in this paper
254instead of being spoon-fed the tremendous information
255necessary” (p. 629).
256A similar sentiment was expressed by Minsky (1974): “I
257draw no boundary between a theory of human thinking
258and a scheme for making an intelligent machine; no
259purpose would be served by separating these today since
260neither domain has theories good enough to explain—or
261to produce—enough mental capacity” (p. 6).
262Much of this research assumed that human knowledge
263representation is symbolic and that reasoning, language,
264planning and vision could be understood in terms of symbolic operations. Parallel to these developments, a radically
265different approach was being explored based on neuronlike “sub-symbolic” computations (e.g., Fukushima 1980;
266Grossberg 1976; Rosenblatt 1958). The representations
267and algorithms used by this approach were more directly
268inspired by neuroscience than by cognitive psychology,
269although ultimately it would ﬂower into an inﬂuential
270school of thought about the nature of cognition: parallel
271distributed processing (PDP) (McClelland et al. 1986;
272Rumelhart et al. 1986b). As its name suggests, PDP emphasizes parallel computation by combining simple units to collectively implement sophisticated computations. The
273knowledge learned by these neural networks is thus distributed across the collection of units rather than localized as in
274most symbolic data structures. The resurgence of recent
275interest in neural networks, more commonly referred to
276as “deep learning,” shares the same representational commitments and often even the same learning algorithms as
277the earlier PDP models. “Deep” refers to the fact that
278more powerful models can be built by composing many
279layers of representation (see LeCun et al. [2015] and
280Schmidhuber [2015] for recent reviews), still very much
281in the PDP style while utilizing recent advances in hardware and computing capabilities, as well as massive data
282sets, to learn deeper models.
283It is also important to clarify that the PDP perspective is
284compatible with “model building” in addition to “pattern
285recognition.” Some of the original work done under the
286banner of PDP (Rumelhart et al. 1986b) is closer to model
287building than pattern recognition, whereas the recent
288large-scale discriminative deep learning systems more
289purely exemplify pattern recognition (see Bottou [2014]
290for a related discussion). But, as discussed, there is also a
291question of the nature of the learned representations
292within the model – their form, compositionality, and transferability – and the developmental start-up software that
293was used to get there. We focus on these issues in this article.
294Neural network models and the PDP approach offer a
295view of the mind (and intelligence more broadly) that is
296sub-symbolic and often populated with minimal constraints
297and inductive biases to guide learning. Proponents of this
298approach maintain that many classic types of structured
299knowledge, such as graphs, grammars, rules, objects, structural descriptions, and programs, can be useful yet misleading metaphors for characterizing thought. These structures
300are more epiphenomenal than real, emergent properties of
301more fundamental sub-symbolic cognitive processes
302(McClelland et al. 2010). Compared with other paradigms
303for studying cognition, this position on the nature of representation is often accompanied by a relatively “blank slate”
304vision of initial knowledge and representation, much like
305Turing’s blank notebook.
306When attempting to understand a particular cognitive
307ability or phenomenon within this paradigm, a common scientiﬁc strategy is to train a relatively generic neural network
308to perform the task, adding additional ingredients only when
309necessary. This approach has shown that neural networks
310can behave as if they learned explicitly structured knowledge, such as a rule for producing the past tense of words
311(Rumelhart & McClelland 1986), rules for solving simple
312balance beam physics problems (McClelland 1988), or a
313tree to represent types of living things (plants and animals)
314and their distribution of properties (Rogers & McClelland
3152004). Training large-scale relatively generic networks is
316also the best current approach for object recognition (He
317et al. 2016; Krizhevsky et al. 2012; Russakovsky et al. 2015;
318Szegedy et al. 2014), where the high-level feature representations of these convolutional nets have also been used to
319predict patterns of neural response in human and
320macaque IT cortex (Khaligh-Razavi & Kriegeskorte 2014;
321Kriegeskorte 2015; Yamins et al. 2014), as well as human
322typicality ratings (Lake et al. 2015b) and similarity ratings
323(Peterson et al. 2016) for images of common objects. Moreover, researchers have trained generic networks to perform
324structured and even strategic tasks, such as the recent work
325on using a Deep Q-learning Network (DQN) to play simple
326video games (Mnih et al. 2015) (see Glossary in Table 1). If
327neural networks have such broad application in machine
328vision, language, and control, and if they can be trained to
329emulate the rule-like and structured behaviors that characterize cognition, do we need more to develop truly
330human-like learning and thinking machines? How far can
331relatively generic neural networks bring us toward this goal?
3323. Challenges for building more human-like
333machines
334Although cognitive science has not yet converged on a
335single account of the mind or intelligence, the claim that
336a mind is a collection of general-purpose neural networks
337with few initial constraints is rather extreme in contemporary cognitive science. A different picture has emerged
338that highlights the importance of early inductive biases,
339including core concepts such as number, space, agency,
340and objects, as well as powerful learning algorithms that
341rely on prior knowledge to extract knowledge from small
342amounts of training data. This knowledge is often richly
343organized and theory-like in structure, capable of the
344graded inferences and productive capacities characteristic
345of human thought.
346Here we present two challenge problems for machine
347learning and AI: learning simple visual concepts (Lake
348et al. 2015a) and learning to play the Atari game Frostbite
349(Mnih et al. 2015). We also use the problems as running
350examples to illustrate the importance of core cognitive
351ingredients in the sections that follow.
3523.1. The Characters Challenge
353
354The ﬁrst challenge concerns handwritten character recognition, a classic problem for comparing different types of
355machine learning algorithms. Hofstadter (1985) argued
356that the problem of recognizing characters in all of the
357ways people do – both handwritten and printed – contains
358most, if not all, of the fundamental challenges of AI.
359Whether or not this statement is correct, it highlights the
360surprising complexity that underlies even “simple”
361human-level concepts like letters. More practically, handwritten character recognition is a real problem that children and adults must learn to solve, with practical
362applications ranging from reading envelope addresses or
363checks in an automated teller machine (ATM). Handwritten character recognition is also simpler than more
364general forms of object recognition; the object of interest
365is two-dimensional, separated from the background, and
366usually unoccluded. Compared with how people learn
367and see other types of objects, it seems possible, in the
368near term, to build algorithms that can see most of the
369structure in characters that people can see.
370The standard benchmark is the Mixed National Institute of
371Standards and Technology (MNIST) data set for digit recognition, which involves classifying images of digits into the categories ‘0’ to ‘9’ (LeCun et al. 1998). The training set provides
3726,000 images per class for a total of 60,000 training images.
373With a large amount of training data available, many algorithms achieve respectable performance, including Knearest neighbors (5% test error), support vector machines
374(about 1% test error), and convolutional neural networks
375(below 1% test error [LeCun et al. 1998]). The best results
376achieved using deep convolutional nets are very close to
377human-level performance at an error rate of 0.2% (Ciresan
378et al. 2012). Similarly, recent results applying convolutional
379nets to the far more challenging ImageNet object recognition
380benchmark have shown that human-level performance is
381within reach on that data set as well (Russakovsky et al. 2015).
382Although humans and neural networks may perform
383equally well on the MNIST digit recognition task and
384other large-scale image classiﬁcation tasks, it does not
385mean that they learn and think in the same way. There
386are at least two important differences: people learn from
387fewer examples and they learn richer representations, a
388comparison true for both learning handwritten characters
389and for learning more general classes of objects (Fig. 1).
390People can learn to recognize a new handwritten character
391from a single example (Fig. 1A-i), allowing them to discriminate between novel instances drawn by other people and
392similar looking non-instances (Lake et al. 2015a; Miller
393et al. 2000). Moreover, people learn more than how to do
394pattern recognition: they learn a concept, that is, a model
395of the class that allows their acquired knowledge to be ﬂexibly applied in new ways. In addition to recognizing new
396examples, people can also generate new examples
397(Fig. 1A-ii), parse a character into its most important
398parts and relations (Fig. 1A-iii) (Lake et al. 2012), and generate new characters given a small set of related characters
399(Fig. 1A-iv). These additional abilities come for free along
400with the acquisition of the underlying concept.
401Even for these simple visual concepts, people are still
402better and more sophisticated learners than the best algorithms for character recognition. People learn a lot more
403from a lot less, and capturing these human-level learning
404abilities in machines is the Characters Challenge. We
405recently reported progress on this challenge using probabilistic program induction (Lake et al. 2015a) (see Glossary in
406Table 1), yet aspects of the full human cognitive ability
407remain out of reach. Although both people and models represent characters as a sequence of pen strokes and relations, people have a far richer repertoire of structural
408relations between strokes. Furthermore, people can efﬁciently integrate across multiple examples of a character
409to infer which have optional elements, such as the horizontal cross-bar in ‘7’s, combining different variants of the
410same character into a single coherent representation.
411Additional progress may come by combining deep learning
412and probabilistic program induction to tackle even richer
413versions of the Characters Challenge.
414
4153.2. The Frostbite Challenge
416
417The second challenge concerns the Atari game Frostbite
418(Fig. 2), which was one of the control problems tackled
419by the DQN of Mnih et al. (2015). The DQN was a signiﬁcant advance in reinforcement learning, showing that a
420single algorithm can learn to play a wide variety of
421complex tasks. The network was trained to play 49 classic
422Atari games, proposed as a test domain for reinforcement
423learning (Bellemare et al. 2013), impressively achieving
424human-level performance or above on 29 of the games. It
425did, however, have particular trouble with Frostbite and
426other games that required temporally extended planning
427strategies.
428In Frostbite, players control an agent (Frostbite Bailey)
429tasked with constructing an igloo within a time limit. The
430igloo is built piece by piece as the agent jumps on ice
431ﬂoes in water (Fig. 2A–C). The challenge is that the ice
432ﬂoes are in constant motion (moving either left or right),
433and ice ﬂoes only contribute to the construction of the
434igloo if they are visited in an active state (white, rather
435than blue). The agent may also earn extra points by gathering ﬁsh while avoiding a number of fatal hazards (falling in
436the water, snow geese, polar bears, etc.). Success in this
437game requires a temporally extended plan to ensure the
438agent can accomplish a sub-goal (such as reaching an ice
439ﬂoe) and then safely proceed to the next sub-goal. Ultimately, once all of the pieces of the igloo are in place,
440the agent must proceed to the igloo and complete the
441level before time expires (Fig. 2C).
442The DQN learns to play Frostbite and other Atari games
443by combining a powerful pattern recognizer (a deep convolutional neural network) and a simple model-free reinforcement learning algorithm (Q-learning [Watkins & Dayan
4441992]). These components allow the network to map
445sensory inputs (frames of pixels) onto a policy over a
446small set of actions, and both the mapping and the policy
447are trained to optimize long-term cumulative reward (the
448game score). The network embodies the strongly empiricist
449approach characteristic of most connectionist models: very
450little is built into the network apart from the assumptions
451about image structure inherent in convolutional networks,
452so the network has to essentially learn a visual and conceptual system from scratch for each new game. In Mnih et al.
453(2015), the network architecture and hyper-parameters
454were ﬁxed, but the network was trained anew for each
455game, meaning the visual system and the policy are
456highly specialized for the games it was trained on. More
457recent work has shown how these game-speciﬁc networks
458can share visual features (Rusu et al. 2016) or be used
459to train a multitask network (Parisotto et al. 2016),
460achieving modest beneﬁts of transfer when learning to
461play new games.
462Although it is interesting that the DQN learns to play
463games at human-level performance while assuming very
464little prior knowledge, the DQN may be learning to play
465Frostbite and other games in a very different way than
466people do. One way to examine the differences is by considering the amount of experience required for learning. In
467Mnih et al. (2015), the DQN was compared with a professional gamer who received approximately 2 hours of practice on each of the 49 Atari games (although he or she
468likely had prior experience with some of the games). The
469DQN was trained on 200 million frames from each of the
470games, which equates to approximately 924 hours of
471game time (about 38 days), or almost 500 times as much
472experience as the human received.2 Additionally, the
473DQN incorporates experience replay, where each of
474these frames is replayed approximately eight more times
475on average over the course of learning.
476With the full 924 hours of unique experience and additional replay, the DQN achieved less than 10% of
477human-level performance during a controlled test session
478(see DQN in Fig. 3). More recent variants of the DQN
479perform better, and can even outperform the human
480tester (Schaul et al. 2016; Stadie et al. 2016; van Hasselt
481et al. 2016; Wang et al. 2016), reaching 83% of the professional gamer’s score by incorporating smarter experience
482replay (Schaul et al. 2016), and 172% by using smarter
483replay and more efﬁcient parameter sharing (Wang et al.
4842016) (see DQN+ and DQN++ in Fig. 3).3 But they
485require a lot of experience to reach this level. The learning
486curve for the model of Wang et al. (2016) shows performance is approximately 44% after 200 hours, 8% after
487100 hours, and less than 2% after 5 hours (which is close
488to random play, approximately 1.5%). The differences
489between the human and machine learning curves suggest
490that they may be learning different kinds of knowledge,
491using different learning mechanisms, or both.
492The contrast becomes even more dramatic if we look at
493the very earliest stages of learning. Although both the original DQN and these more recent variants require multiple
494hours of experience to perform reliably better than random
495play, even non-professional humans can grasp the basics of
496the game after just a few minutes of play. We speculate that
497people do this by inferring a general schema to describe the
498goals of the game and the object types and their interactions, using the kinds of intuitive theories, model-building
499abilities and model-based planning mechanisms we
500describe below. Although novice players may make some
501mistakes, such as inferring that ﬁsh are harmful rather
502than helpful, they can learn to play better than chance
503within a few minutes. If humans are able to ﬁrst watch
504an expert playing for a few minutes, they can learn
505even faster. In informal experiments with two of the
506authors playing Frostbite on a Javascript emulator, after watching videos of expert play on YouTube for just 2 minutes,
507we found that we were able to reach scores comparable
508to or better than the human expert reported in Mnih
509et al. (2015) after at most 15 to 20 minutes of total
510practice.
511There are other behavioral signatures that suggest fundamental differences in representation and learning between
512people and the DQN. For example, the game of Frostbite
513provides incremental rewards for reaching each active ice
514ﬂoe, providing the DQN with the relevant sub-goals for
515completing the larger task of building an igloo. Without
516these sub-goals, the DQN would have to take random
517actions until it accidentally builds an igloo and is rewarded
518for completing the entire level. In contrast, people likely do
519not rely on incremental scoring in the same way when ﬁguring out how to play a new game. In Frostbite, it is possible to ﬁgure out the higher-level goal of building an igloo
520without incremental feedback; similarly, sparse feedback
521is a source of difﬁculty in other Atari 2600 games such as
522Montezuma’s Revenge, in which people substantially outperform current DQN approaches.
523The learned DQN network is also rather inﬂexible to
524changes in its inputs and goals. Changing the color or
525appearance of objects or changing the goals of the
526network would have devastating consequences on performance if the network is not retrained. Although any speciﬁc
527model is necessarily simpliﬁed and should not be held to
528the standard of general human intelligence, the contrast
529between DQN and human ﬂexibility is striking nonetheless. For example, imagine you are tasked with playing
530Frostbite with any one of these new goals:
5311. Get the lowest possible score.
5322. Get closest to 100, or 300, or 1,000, or 3,000, or any
533level, without going over.
5343. Beat your friend, who’s playing next to you, but just
535barely, not by too much, so as not to embarrass them.
5364. Go as long as you can without dying.
5375. Die as quickly as you can.
5386. Pass each level at the last possible minute, right before
539the temperature timer hits zero and you die (i.e., come as
540close as you can to dying from frostbite without actually dying).
5417. Get to the furthest unexplored level without regard
542for your score.
5438. See if you can discover secret Easter eggs.
5449. Get as many ﬁsh as you can.
54510. Touch all of the individual ice ﬂoes on screen once
546and only once.
54711. Teach your friend how to play as efﬁciently as
548possible.
549This range of goals highlights an essential component of
550human intelligence: people can learn models and use them
551for arbitrary new tasks and goals. Although neural networks
552can learn multiple mappings or tasks with the same set of
553stimuli – adapting their outputs depending on a speciﬁed
554goal – these models require substantial training or reconﬁguration to add new tasks (e.g., Collins & Frank 2013; Eliasmith et al. 2012; Rougier et al. 2005). In contrast, people
555require little or no retraining or reconﬁguration, adding
556new tasks and goals to their repertoire with relative ease.
557
558The Frostbite example is a particularly telling contrast
559when compared with human play. Even the best deep networks learn gradually over many thousands of game episodes, take a long time to reach good performance, and
560are locked into particular input and goal patterns. Humans,
561after playing just a small number of games over a span of
562minutes, can understand the game and its goals well
563enough to perform better than deep networks do after
564almost a thousand hours of experience. Even more impressively, people understand enough to invent or accept new
565goals, generalize over changes to the input, and explain the
566game to others. Why are people different? What core ingredients of human intelligence might the DQN and other
567modern machine learning methods be missing?
568One might object that both the Frostbite and Characters
569challenges draw an unfair comparison between the speed
570of human learning and neural network learning. We
571discuss this objection in detail in Section 5, but we feel it
572is important to anticipate it here as well. To paraphrase
573one reviewer of an earlier draft of this article, “It is not
574that DQN and people are solving the same task differently.
575They may be better seen as solving different tasks. Human
576learners – unlike DQN and many other deep learning
577systems – approach new problems armed with extensive
578prior experience. The human is encountering one in a
579years-long string of problems, with rich overlapping
580structure. Humans as a result often have important
581domain-speciﬁc knowledge for these tasks, even before
582they ‘begin.’ The DQN is starting completely from scratch.”
583We agree, and indeed this is another way of putting our
584point here. Human learners fundamentally take on different
585learning tasks than today’s neural networks, and if we want to
586build machines that learn and think like people, our
587machines need to confront the kinds of tasks that human
588learners do, not shy away from them. People never start
589completely from scratch, or even close to “from scratch,”
590and that is the secret to their success. The challenge of building models of human learning and thinking then becomes:
591How do we bring to bear rich prior knowledge to learn
592new tasks and solve new problems so quickly? What form
593does that prior knowledge take, and how is it constructed,
594from some combination of inbuilt capacities and previous
595experience? The core ingredients we propose in the next
596section offer one route to meeting this challenge.
597
5984. Core ingredients of human intelligence
599In the Introduction, we laid out what we see as core ingredients of intelligence. Here we consider the ingredients in
600detail and contrast them with the current state of neural
601network modeling. Although these are hardly the only ingredients needed for human-like learning and thought (see our
602discussion of language in sect. 5), they are key building
603blocks, which are not present in most current learningbased AI systems – certainly not all present together – and
604for which additional attention may prove especially fruitful.
605We believe that integrating them will produce signiﬁcantly
606more powerful and more human-like learning and thinking
607abilities than we currently see in AI systems.
608Before considering each ingredient in detail, it is important to clarify that by “core ingredient” we do not necessarily mean an ingredient that is innately speciﬁed by genetics
609or must be “built in” to any learning algorithm. We intend
610
611
612our discussion to be agnostic with regards to the origins of
613the key ingredients. By the time a child or an adult is
614picking up a new character or learning how to play
615Frostbite, he or she is armed with extensive real-world experience that deep learning systems do not beneﬁt from –
616experience that would be hard to emulate in any general
617sense. Certainly, the core ingredients are enriched by this
618experience, and some may even be a product of the experience itself. Whether learned, built in, or enriched, the key
619claim is that these ingredients play an active and important
620role in producing human-like learning and thought, in
621ways contemporary machine learning has yet to capture.
622
6234.1. Developmental start-up software
624
625Early in development, humans have a foundational
626understanding of several core domains (Spelke 2003;
627Spelke & Kinzler 2007). These domains include number
628(numerical and set operations), space (geometry and navigation), physics (inanimate objects and mechanics), and
629psychology (agents and groups). These core domains
630cleave cognition at its conceptual joints, and each domain
631is organized by a set of entities and abstract principles relating the entities to each other. The underlying cognitive representations can be understood as “intuitive theories,” with
632a causal structure resembling a scientiﬁc theory (Carey
6332004; 2009; Gopnik et al. 2004; Gopnik & Meltzo 1999;
634Gweon et al. 2010; Schulz 2012b; Wellman & Gelman
6351992; 1998). The “child as scientist” proposal further views
636the process of learning itself as also scientist-like, with
637recent experiments showing that children seek out new
638data to distinguish between hypotheses, isolate variables,
639test causal hypotheses, make use of the data-generating
640process in drawing conclusions, and learn selectively from
641others (Cook et al. 2011; Gweon et al. 2010; Schulz et al.
6422007; Stahl & Feigenson 2015; Tsividis et al. 2013). We
643address the nature of learning mechanisms in Section 4.2.
644Each core domain has been the target of a great deal of
645study and analysis, and together the domains are thought to
646be shared cross-culturally and partly with non-human
647animals. All of these domains may be important augmentations to current machine learning, though below, we focus
648in particular on the early understanding of objects and
649agents.
650
6514.1.1. Intuitive physics. Young children have a rich knowledge of intuitive physics. Whether learned or innate,
652important physical concepts are present at ages far earlier
653than when a child or adult learns to play Frostbite, suggesting these resources may be used for solving this and many
654everyday physics-related tasks.
655At the age of 2 months, and possibly earlier, human
656infants expect inanimate objects to follow principles of persistence, continuity, cohesion, and solidity. Young infants
657believe objects should move along smooth paths, not
658wink in and out of existence, not inter-penetrate and not
659act at a distance (Spelke 1990; Spelke et al. 1995). These
660expectations guide object segmentation in early infancy,
661emerging before appearance-based cues such as color,
662texture, and perceptual goodness (Spelke 1990).
663These expectations also go on to guide later learning. At
664around 6 months, infants have already developed different
665expectations for rigid bodies, soft bodies, and liquids (Rips
666& Hespos 2015). Liquids, for example, are expected to go
667
668through barriers, while solid objects cannot (Hespos et al.
6692009). By their ﬁrst birthday, infants have gone through
670several transitions of comprehending basic physical concepts such as inertia, support, containment, and collisions
671(Baillargeon 2004; Baillargeon et al. 2009; Hespos & Baillargeon 2008).
672There is no single agreed-upon computational account of
673these early physical principles and concepts, and previous
674suggestions have ranged from decision trees (Baillargeon
675et al. 2009), to cues, to lists of rules (Siegler & Chen
6761998). A promising recent approach sees intuitive physical
677reasoning as similar to inference over a physics software
678engine, the kind of simulators that power modern-day animations and games (Bates et al. 2015; Battaglia et al. 2013;
679Gerstenberg et al. 2015; Sanborn et al. 2013). According to
680this hypothesis, people reconstruct a perceptual scene using
681internal representations of the objects and their physically
682relevant properties (such as mass, elasticity, and surface friction) and forces acting on objects (such as gravity, friction, or
683collision impulses). Relative to physical ground truth, the
684intuitive physical state representation is approximate and
685probabilistic, and oversimpliﬁed and incomplete in many
686ways. Still, it is rich enough to support mental simulations
687that can predict how objects will move in the immediate
688future, either on their own or in responses to forces we
689might apply.
690This “intuitive physics engine” approach enables ﬂexible
691adaptation to a wide range of everyday scenarios and judgments in a way that goes beyond perceptual cues. For
692example, (Fig. 4), a physics-engine reconstruction of a
693tower of wooden blocks from the game Jenga can be
694used to predict whether (and how) a tower will fall,
695ﬁnding close quantitative ﬁts to how adults make these predictions (Battaglia et al. 2013), as well as simpler kinds of
696physical predictions that have been studied in infants
697(Téglás et al. 2011). Simulation-based models can also
698capture how people make hypothetical or counterfactual
699predictions: What would happen if certain blocks were
700taken away, more blocks were added, or the table supporting the tower was jostled? What if certain blocks were glued
701together, or attached to the table surface? What if the
702blocks were made of different materials (Styrofoam, lead,
703ice)? What if the blocks of one color were much heavier
704than those of other colors? Each of these physical judgments may require new features or new training for a
705pattern recognition account to work at the same level as
706the model-based simulator.
707What are the prospects for embedding or acquiring this
708kind of intuitive physics in deep learning systems? Connectionist models in psychology have previously been applied
709to physical reasoning tasks such as balance-beam rules
710(McClelland 1988; Shultz 2003) or rules relating to distance, velocity, and time in motion (Buckingham &
711Shultz 2000). However, these networks do not attempt to
712work with complex scenes as input, or a wide range of scenarios and judgments as in Figure 4. A recent paper from
713Facebook AI researchers (Lerer et al. 2016) represents
714an exciting step in this direction. Lerer et al. (2016)
715trained a deep convolutional network-based system
716(PhysNet) to predict the stability of block towers from simulated images similar to those in Figure 4A, but with much
717simpler conﬁgurations of two, three, or four cubical blocks
718stacked vertically. Impressively, PhysNet generalized to
719simple real images of block towers, matching human performance on these images, meanwhile exceeding human
720performance on synthetic images. Human and PhysNet
721conﬁdence were also correlated across towers, although
722not as strongly as for the approximate probabilistic simulation models and experiments of Battaglia et al. (2013). One
723limitation is that PhysNet currently requires extensive
724training – between 100,000 and 200,000 scenes – to learn
725judgments for just a single task (will the tower fall?) on a
726narrow range of scenes (towers with two to four cubes).
727It has been shown to generalize, but also only in limited
728ways (e.g., from towers of two and three cubes to towers
729of four cubes). In contrast, people require far less experience to perform any particular task, and can generalize to
730many novel judgments and complex scenes with no new
731training required (although they receive large amounts of
732physics experience through interacting with the world
733more generally). Could deep learning systems such as
734PhysNet capture this ﬂexibility, without explicitly
735
736simulating the causal interactions between objects in three
737dimensions? We are not sure, but we hope this is a challenge they will take on.
738Alternatively, instead of trying to make predictions
739without simulating physics, could neural networks be
740trained to emulate a general-purpose physics simulator,
741given the right type and quantity of training data, such as
742the raw input experienced by a child? This is an active
743and intriguing area of research, but it too faces signiﬁcant
744challenges. For networks trained on object classiﬁcation,
745deeper layers often become sensitive to successively
746higher-level features, from edges to textures to shapeparts to full objects (Yosinski et al. 2014; Zeiler & Fergus
7472014). For deep networks trained on physics-related data,
748it remains to be seen whether higher layers will encode
749objects, general physical properties, forces, and approximately Newtonian dynamics. A generic network trained
750on dynamic pixel data might learn an implicit representation of these concepts, but would it generalize broadly
751beyond training contexts as people’s more explicit physical
752concepts do? Consider, for example, a network that learns
753to predict the trajectories of several balls bouncing in a box
754(Kodratoff & Michalski 2014). If this network has actually
755learned something like Newtonian mechanics, then it
756should be able to generalize to interestingly different
757scenarios – at a minimum different numbers of differently
758shaped objects, bouncing in boxes of different shapes and
759sizes and orientations with respect to gravity, not to
760mention more severe generalization tests such as all of
761the tower tasks discussed above, which also fall under the
762Newtonian domain. Neural network researchers have yet
763to take on this challenge, but we hope they will. Whether
764such models can be learned with the kind (and quantity)
765of data available to human infants is not clear, as we
766discuss further in Section 5.
767It may be difﬁcult to integrate object and physics-based
768primitives into deep neural networks, but the payoff in
769terms of learning speed and performance could be great
770for many tasks. Consider the case of learning to play Frostbite. Although it can be difﬁcult to discern exactly how a
771network learns to solve a particular task, the DQN probably
772does not parse a Frostbite screenshot in terms of stable
773objects or sprites moving according to the rules of intuitive
774physics (Fig. 2). But incorporating a physics-engine–based
775representation could help DQNs learn to play games such
776as Frostbite in a faster and more general way, whether the
777physics knowledge is captured implicitly in a neural
778network or more explicitly in a simulator. Beyond reducing
779the amount of training data, and potentially improving the
780level of performance reached by the DQN, it could eliminate the need to retrain a Frostbite network if the objects
781(e.g., birds, ice ﬂoes, and ﬁsh) are slightly altered in their
782behavior, reward structure, or appearance. When a new
783object type such as a bear is introduced, as in the later
784levels of Frostbite (Fig. 2D), a network endowed with intuitive physics would also have an easier time adding this
785object type to its knowledge (the challenge of adding new
786objects was also discussed in Marcus [1998; 2001]). In
787this way, the integration of intuitive physics and deep learning could be an important step toward more human-like
788learning algorithms.
7894.1.2. Intuitive psychology. Intuitive psychology is another
790early-emerging ability with an important inﬂuence on
791
792human learning and thought. Pre-verbal infants distinguish
793animate agents from inanimate objects. This distinction is
794partially based on innate or early-present detectors for
795low-level cues, such as the presence of eyes, motion initiated
796from rest, and biological motion (Johnson et al. 1998;
797Premack & Premack 1997; Schlottmann et al. 2006; Tremoulet & Feldman 2000). Such cues are often sufﬁcient but
798not necessary for the detection of agency.
799Beyond these low-level cues, infants also expect agents to
800act contingently and reciprocally, to have goals, and to take
801efﬁcient actions toward those goals subject to constraints
802(Csibra 2008; Csibra et al. 2003; Spelke & Kinzler 2007).
803These goals can be socially directed; at around 3 months
804of age, infants begin to discriminate antisocial agents that
805hurt or hinder others from neutral agents (Hamlin 2013;
806Hamlin et al. 2010), and they later distinguish between
807anti-social, neutral, and pro-social agents (Hamlin et al.
8082007; 2013).
809It is generally agreed that infants expect agents to act in a
810goal-directed, efﬁcient, and socially sensitive fashion
811(Spelke & Kinzler 2007). What is less agreed on is the computational architecture that supports this reasoning and
812whether it includes any reference to mental states and
813explicit goals.
814One possibility is that intuitive psychology is simply cues
815“all the way down” (Schlottmann et al. 2013; Scholl & Gao
8162013), though this would require more and more cues as
817the scenarios become more complex. Consider, for
818example, a scenario in which an agent A is moving toward
819a box, and an agent B moves in a way that blocks A from
820reaching the box. Infants and adults are likely to interpret
821B’s behavior as “hindering” (Hamlin 2013). This inference
822could be captured by a cue that states, “If an agent’s
823expected trajectory is prevented from completion, the
824blocking agent is given some negative association.”
825Although the cue is easily calculated, the scenario is also
826easily changed to necessitate a different type of cue.
827Suppose A was already negatively associated (a “bad
828guy”); acting negatively toward A could then be seen as
829good (Hamlin 2013). Or suppose something harmful was
830in the box, which A did not know about. Now B would
831be seen as helping, protecting, or defending A. Suppose
832A knew there was something bad in the box and wanted
833it anyway. B could be seen as acting paternalistically. A
834cue-based account would be twisted into gnarled combinations such as, “If an expected trajectory is prevented from
835completion, the blocking agent is given some negative association, unless that trajectory leads to a negative outcome or
836the blocking agent is previously associated as positive, or
837the blocked agent is previously associated as negative,
838or….”
839One alternative to a cue-based account is to use generative models of action choice, as in the Bayesian inverse
840planning, or Bayesian theory of mind (ToM), models of
841Baker et al. (2009) or the naive utility calculus models of
842Jara-Ettinger et al. (2015) (see also Jern and Kemp
843[2015] and Tauber and Steyvers [2011] and a related alternative based on predictive coding from Kilner et al. [2007]).
844These models formalize explicitly mentalistic concepts such
845as “goal,” “agent,” “planning,” “cost,” “efﬁciency,” and
846“belief,” used to describe core psychological reasoning in
847infancy. They assume adults and children treat agents as
848approximately rational planners who choose the most efﬁcient means to their goals. Planning computations may be
849
850formalized as solutions to Markov decision processes
851(MDPs) or partially observable Markov decision processes
852(POMDPs), taking as input utility and belief functions
853deﬁned over an agent’s state-space and the agent’s stateaction transition functions, and returning a series of
854actions the agent should perform to most efﬁciently fulﬁll
855their goals (or maximize their utility). By simulating these
856planning processes, people can predict what agents might
857do next, or use inverse reasoning from observing a series
858of actions to infer the utilities and beliefs of agents in a
859scene. This is directly analogous to how simulation engines
860can be used for intuitive physics, to predict what will
861happen next in a scene or to infer objects’ dynamical properties from how they move. It yields similarly ﬂexible reasoning
862abilities: Utilities and beliefs can be adjusted to take into
863account how agents might act for a wide range of novel
864goals and situations. Importantly, unlike in intuitive
865physics, simulation-based reasoning in intuitive psychology
866can be nested recursively to understand social interactions.
867We can think about agents thinking about other agents.
868As in the case of intuitive physics, the success that
869generic deep networks will have in capturing intuitive psychological reasoning will depend in part on the representations humans use. Although deep networks have not yet
870been applied to scenarios involving theory of mind and
871intuitive psychology, they could probably learn visual
872cues, heuristics and summary statistics of a scene that
873happens to involve agents.5 If that is all that underlies
874human psychological reasoning, a data-driven deep learning approach can likely ﬁnd success in this domain.
875However, it seems to us that any full formal account of
876intuitive psychological reasoning needs to include representations of agency, goals, efﬁciency, and reciprocal relations. As with objects and forces, it is unclear whether a
877complete representation of these concepts (agents, goals,
878etc.) could emerge from deep neural networks trained in
879a purely predictive capacity. Similar to the intuitive
880physics domain, it is possible that with a tremendous
881number of training trajectories in a variety of scenarios,
882deep learning techniques could approximate the reasoning
883found in infancy even without learning anything about goaldirected or socially directed behavior more generally. But
884this is also unlikely to resemble how humans learn, understand, and apply intuitive psychology unless the concepts
885are genuine. In the same way that altering the setting of
886a scene or the target of inference in a physics-related task
887may be difﬁcult to generalize without an understanding
888of objects, altering the setting of an agent or their goals
889and beliefs is difﬁcult to reason about without understanding intuitive psychology.
890In introducing the Frostbite challenge, we discussed how
891people can learn to play the game extremely quickly by
892watching an experienced player for just a few minutes
893and then playing a few rounds themselves. Intuitive psychology provides a basis for efﬁcient learning from
894others, especially in teaching settings with the goal of communicating knowledge efﬁciently (Shafto et al. 2014). In
895the case of watching an expert play Frostbite, whether or
896not there is an explicit goal to teach, intuitive psychology
897lets us infer the beliefs, desires, and intentions of the experienced player. For example, we can learn that the birds are
898to be avoided from seeing how the experienced player
899appears to avoid them. We do not need to experience a
900single example of encountering a bird, and watching
901Frostbite Bailey die because of the bird, to infer that
902birds are probably dangerous. It is enough to see that the
903experienced player’s avoidance behavior is best explained
904as acting under that belief.
905Similarly, consider how a sidekick agent (increasingly
906popular in video games) is expected to help a player
907achieve his or her goals. This agent can be useful in different ways in different circumstances, such as getting items,
908clearing paths, ﬁghting, defending, healing, and providing
909information, all under the general notion of being helpful
910(Macindoe 2013). An explicit agent representation can
911predict how such an agent will be helpful in new circumstances, whereas a bottom-up pixel-based representation
912is likely to struggle.
913There are several ways that intuitive psychology could be
914incorporated into contemporary deep learning systems.
915Although it could be built in, intuitive psychology may
916arise in other ways. Connectionists have argued that
917innate constraints in the form of hard-wired cortical circuits
918are unlikely (Elman 2005; Elman et al. 1996), but a simple
919inductive bias, for example, the tendency to notice things
920that move other things, can bootstrap reasoning about
921more abstract concepts of agency (Ullman et al. 2012a).6
922Similarly, a great deal of goal-directed and socially directed
923actions can also be boiled down to a simple utility calculus
924(e.g., Jara-Ettinger et al. 2015), in a way that could be
925shared with other cognitive abilities. Although the origins
926of intuitive psychology are still a matter of debate, it is
927clear that these abilities are early emerging and play an
928important role in human learning and thought, as exempliﬁed in the Frostbite challenge and when learning to play
929novel video games more broadly.
9304.2. Learning as rapid model building
931
932Since their inception, neural networks models have
933stressed the importance of learning. There are many learning algorithms for neural networks, including the perceptron algorithm (Rosenblatt 1958), Hebbian learning
934(Hebb 1949), the BCM rule (Bienenstock et al. 1982),
935backpropagation (Rumelhart et al. 1986a), the wake-sleep
936algorithm (Hinton et al. 1995), and contrastive divergence
937(Hinton 2002). Whether the goal is supervised or unsupervised learning, these algorithms implement learning as a
938process of gradual adjustment of connection strengths.
939For supervised learning, the updates are usually aimed at
940improving the algorithm’s pattern recognition capabilities.
941For unsupervised learning, the updates work toward gradually matching the statistics of the model’s internal patterns
942with the statistics of the input data.
943In recent years, machine learning has found particular
944success using backpropagation and large data sets to solve
945difﬁcult pattern recognition problems (see Glossary in
946Table 1). Although these algorithms have reached humanlevel performance on several challenging benchmarks, they
947are still far from matching human-level learning in other
948ways. Deep neural networks often need more data than
949people do to solve the same types of problems, whether it
950is learning to recognize a new type of object or learning to
951play a new game. When learning the meanings of words in
952their native language, children make meaningful generalizations from very sparse data (Carey & Bartlett 1978; Landau
953et al. 1988; Markman 1989; Smith et al. 2002; Xu & Tenenbaum 2007; although see Horst & Samuelson 2008
954
955regarding memory limitations). Children may only need to
956see a few examples of the concepts hairbrush, pineapple,
957and lightsaber, before they largely “get it,” grasping the
958boundary of the inﬁnite set that deﬁnes each concept from
959the inﬁnite set of all possible objects. Children are far
960more practiced than adults at learning new concepts, learning roughly 9 or 10 new words each day, after beginning to
961speak through the end of high school (Bloom 2000; Carey
9621978). Yet the ability for rapid “one-shot” learning does not
963disappear in adulthood. An adult may need to see a single
964image or movie of a novel two-wheeled vehicle to infer the
965boundary between this concept and others, allowing him or
966her to discriminate new examples of that concept from
967similar-looking objects of a different type (Fig. 1B-i).
968Contrasting with the efﬁciency of human learning,
969neural networks, by virtue of their generality as highly ﬂexible function approximators, are notoriously data hungry
970(the bias/variance dilemma [Geman et al. 1992]). Benchmark tasks such as the ImageNet data set for object recognition provide hundreds or thousands of examples per class
971(Krizhevsky et al. 2012; Russakovsky et al. 2015): 1,000
972hairbrushes, 1,000 pineapples, and so on. In the context
973of learning new, handwritten characters or learning to
974play Frostbite, the MNIST benchmark includes 6,000
975examples of each handwritten digit (LeCun et al. 1998),
976and the DQN of Mnih et al. (2015) played each Atari
977video game for approximately 924 hours of unique training
978experience (Fig. 3). In both cases, the algorithms are clearly
979using information less efﬁciently than a person learning to
980perform the same tasks.
981It is also important to mention that there are many
982classes of concepts that people learn more slowly. Concepts
983that are learned in school are usually far more challenging
984and more difﬁcult to acquire, including mathematical functions, logarithms, derivatives, integrals, atoms, electrons,
985gravity, DNA, and evolution. There are also domains for
986which machine learners outperform human learners, such
987as combing through ﬁnancial or weather data. But for the
988vast majority of cognitively natural concepts – the types of
989things that children learn as the meanings of words –
990people are still far better learners than machines. This is
991the type of learning we focus on in this section, which is
992more suitable for the enterprise of reverse engineering
993and articulating additional principles that make human
994learning successful. It also opens the possibility of building
995these ingredients into the next generation of machine
996learning and AI algorithms, with potential for making progress on learning concepts that are both easy and difﬁcult
997for humans to acquire.
998Even with just a few examples, people can learn remarkably rich conceptual models. One indicator of richness is
999the variety of functions that these models support
1000(Markman & Ross 2003; Solomon et al. 1999). Beyond classiﬁcation, concepts support prediction (Murphy & Ross
10011994; Rips 1975), action (Barsalou 1983), communication
1002(Markman & Makin 1998), imagination (Jern & Kemp
10032013; Ward 1994), explanation (Lombrozo 2009; Williams
1004& Lombrozo 2010), and composition (Murphy 1988; Osherson & Smith 1981). These abilities are not independent;
1005rather they hang together and interact (Solomon et al.
10061999), coming for free with the acquisition of the underlying concept. Returning to the previous example of a novel
1007two-wheeled vehicle, a person can sketch a range of new
1008instances (Fig. 1B-ii), parse the concept into its most
1009
1010important components (Fig. 1B-iii), or even create a new
1011complex concept through the combination of familiar concepts (Fig. 1B-iv). Likewise, as discussed in the context of
1012Frostbite, a learner who has acquired the basics of the
1013game could ﬂexibly apply his or her knowledge to an inﬁnite set of Frostbite variants (sect. 3.2). The acquired
1014knowledge supports reconﬁguration to new tasks and new
1015demands, such as modifying the goals of the game to
1016survive, while acquiring as few points as possible, or to efﬁciently teach the rules to a friend.
1017This richness and ﬂexibility suggest that learning as
1018model building is a better metaphor than learning as
1019pattern recognition. Furthermore, the human capacity for
1020one-shot learning suggests that these models are built
1021upon rich domain knowledge rather than starting from a
1022blank slate (Mikolov et al. 2016; Mitchell et al. 1986). In contrast, much of the recent progress in deep learning has been
1023on pattern recognition problems, including object recognition, speech recognition, and (model-free) video game learning, that use large data sets and little domain knowledge.
1024There has been recent work on other types of tasks,
1025including learning generative models of images (Denton
1026et al. 2015; Gregor et al. 2015), caption generation (Karpathy & Fei-Fei 2017; Vinyals et al. 2014; Xu et al. 2015),
1027question answering (Sukhbaatar et al. 2015; Weston et al.
10282015b), and learning simple algorithms (Graves et al.
10292014; Grefenstette et al. 2015). We discuss question
1030answering and learning simple algorithms in Section 6.1.
1031Yet, at least for image and caption generation, these tasks
1032have been mostly studied in the big data setting that is at
1033odds with the impressive human ability to generalize
1034from small data sets (although see Rezende et al. [2016]
1035for a deep learning approach to the Character Challenge).
1036And it has been difﬁcult to learn neural network–style representations that effortlessly generalize new tasks that they
1037were not trained on (see Davis & Marcus 2015; Marcus
10381998; 2001). What additional ingredients may be needed
1039to rapidly learn more powerful and more general-purpose
1040representations?
1041A relevant case study is from our own work on the Characters Challenge (sect. 3.1; Lake 2014; Lake et al. 2015a).
1042People and various machine learning approaches were
1043compared on their ability to learn new handwritten characters from the world’s alphabets. In addition to evaluating
1044several types of deep learning models, we developed an
1045algorithm using Bayesian program learning (BPL) that represents concepts as simple stochastic programs: structured
1046procedures that generate new examples of a concept when
1047executed (Fig. 5A). These programs allow the model to
1048express causal knowledge about how the raw data are
1049formed, and the probabilistic semantics allow the model to
1050handle noise and perform creative tasks. Structure sharing
1051across concepts is accomplished by the compositional
1052re-use of stochastic primitives that can combine in new
1053ways to create new concepts.
1054Note that we are overloading the word model to refer to
1055the BPL framework as a whole (which is a generative
1056model), as well as the individual probabilistic models (or
1057concepts) that it infers from images to represent novel
1058handwritten characters. There is a hierarchy of models: a
1059higher-level program that generates different types of concepts, which are themselves programs that can be run to
1060generate tokens of a concept. Here, describing learning
1061as “rapid model building” refers to the fact that BPL
1062
1063
1064constructs generative models (lower-level programs) that
1065produce tokens of a concept (Fig. 5B).
1066Learning models of this form allows BPL to perform a
1067challenging one-shot classiﬁcation task at human-level performance (Fig. 1A-i) and to outperform current deep learning models such as convolutional networks (Koch et al.
10682015).7 The representations that BPL learns also enable
1069it to generalize in other, more creative, human-like ways,
1070as evaluated using “visual Turing tests” (e.g., Fig. 5B).
1071These tasks include generating new examples (Figs. 1A-ii
1072and 5B), parsing objects into their essential components
1073(Fig. 1A-iii), and generating new concepts in the style of
1074a particular alphabet (Fig. 1A-iv). The following sections
1075discuss the three main ingredients – compositionality, causality, and learning-to-learn – that were important to the
1076success of this framework and, we believe, are important
1077to understanding human learning as rapid model building
1078more broadly. Although these ingredients ﬁt naturally
1079within a BPL or a probabilistic program induction framework, they could also be integrated into deep learning
1080models and other types of machine learning algorithms,
1081prospects we discuss in more detail below.
10824.2.1. Compositionality. Compositionality is the classic
1083idea that new representations can be constructed through
1084the combination of primitive elements. In computer programming, primitive functions can be combined to create
1085new functions, and these new functions can be further
1086combined to create even more complex functions. This
1087function hierarchy provides an efﬁcient description of
1088higher-level functions, such as a hierarchy of parts for
1089describing complex objects or scenes (Bienenstock et al.
10901997). Compositionality is also at the core of productivity:
1091an inﬁnite number of representations can be constructed
1092from a ﬁnite set of primitives, just as the mind can think
1093an inﬁnite number of thoughts, utter or understand an inﬁnite number of sentences, or learn new concepts from a
1094
1095seemingly inﬁnite space of possibilities (Fodor 1975;
1096Fodor & Pylyshyn 1988; Marcus 2001; Piantadosi 2011).
1097Compositionality has been broadly inﬂuential in both AI
1098and cognitive science, especially as it pertains to theories of
1099object recognition, conceptual representation, and language. Here, we focus on compositional representations
1100of object concepts for illustration. Structural description
1101models represent visual concepts as compositions of parts
1102and relations, which provides a strong inductive bias for
1103constructing models of new concepts (Biederman 1987;
1104Hummel & Biederman 1992; Marr & Nishihara 1978;
1105van den Hengel et al. 2015; Winston 1975). For instance,
1106the novel two-wheeled vehicle in Figure 1B might be represented as two wheels connected by a platform, which
1107provides the base for a post, which holds the handlebars,
1108and so on. Parts can themselves be composed of subparts, forming a “partonomy” of part-whole relationships
1109(Miller & Johnson-Laird 1976; Tversky & Hemenway
11101984). In the novel vehicle example, the parts and relations
1111can be shared and re-used from existing related concepts,
1112such as cars, scooters, motorcycles, and unicycles.
1113Because the parts and relations are themselves a product
1114of previous learning, their facilitation of the construction
1115of new models is also an example of learning-to-learn,
1116another ingredient that is covered below. Although compositionality and learning-to-learn ﬁt naturally together, there
1117are also forms of compositionality that rely less on previous
1118learning, such as the bottom-up, parts-based representation of Hoffman and Richards (1984).
1119Learning models of novel handwritten characters can be
1120operationalized in a similar way. Handwritten characters
1121are inherently compositional, where the parts are pen
1122strokes, and relations describe how these strokes connect
1123to each other. Lake et al. (2015a) modeled these parts
1124using an additional layer of compositionality, where parts
1125are complex movements created from simpler sub-part
1126movements. New characters can be constructed by
1127
1128combining parts, sub-parts, and relations in novel ways
1129(Fig. 5). Compositionality is also central to the construction
1130of other types of symbolic concepts beyond characters,
1131where new spoken words can be created through a novel
1132combination of phonemes (Lake et al. 2014), or a new
1133gesture or dance move can be created through a combination of more primitive body movements.
1134An efﬁcient representation for Frostbite should be similarly compositional and productive. A scene from the
1135game is a composition of various object types, including
1136birds, ﬁsh, ice ﬂoes, igloos, and so on (Fig. 2). Representing
1137this compositional structure explicitly is both more economical and better for generalization, as noted in previous
1138work on object-oriented reinforcement learning (Diuk
1139et al. 2008). Many repetitions of the same objects are
1140present at different locations in the scene, and therefore,
1141representing each as an identical instance of the same
1142object with the same properties is important for efﬁcient
1143representation and quick learning of the game. Further,
1144new levels may contain different numbers and combinations of objects, where a compositional representation of
1145objects – using intuitive physics and intuitive psychology
1146as glue – would aid in making these crucial generalizations
1147(Fig. 2D).
1148Deep neural networks have at least a limited notion of
1149compositionality. Networks trained for object recognition
1150encode part-like features in their deeper layers (Zeiler &
1151Fergus 2014), whereby the presentation of new types of
1152objects can activate novel combinations of feature detectors. Similarly, a DQN trained to play Frostbite may
1153learn to represent multiple replications of the same
1154object with the same features, facilitated by the invariance
1155properties of a convolutional neural network architecture.
1156Recent work has shown how this type of compositionality
1157can be made more explicit, where neural networks can be
1158used for efﬁcient inference in more structured generative
1159models (both neural networks and three-dimensional
1160scene models) that explicitly represent the number of
1161objects in a scene (Eslami et al. 2016). Beyond the compositionality inherent in parts, objects, and scenes, compositionality can also be important at the level of goals and
1162sub-goals. Recent work on hierarchical DQNs shows that
1163by providing explicit object representations to a DQN,
1164and then deﬁning sub-goals based on reaching those
1165objects, DQNs can learn to play games with sparse
1166rewards (such as Montezuma’s Revenge) by combining
1167these sub-goals together to achieve larger goals (Kulkarni
1168et al. 2016).
1169We look forward to seeing these new ideas continue to
1170develop, potentially providing even richer notions of compositionality in deep neural networks that lead to faster
1171and more ﬂexible learning. To capture the full extent of
1172the mind’s compositionality, a model must include explicit
1173representations of objects, identity, and relations, all while
1174maintaining a notion of “coherence” when understanding
1175novel conﬁgurations. Coherence is related to our next principle, causality, which is discussed in the section that
1176follows.
11774.2.2. Causality. In concept learning and scene under-
1178
1179standing, causal models represent hypothetical real-world
1180processes that produce the perceptual observations. In
1181control and reinforcement learning, causal models
1182
1183represent the structure of the environment, such as modeling state-to-state transitions or action/state-to-state
1184transitions.
1185Concept learning and vision models that use causality are
1186usually generative (as opposed to discriminative; see Glossary in Table 1), but not every generative model is also
1187causal. Although a generative model describes a process
1188for generating data, or at least assigns a probability distribution over possible data points, this generative process may
1189not resemble how the data are produced in the real
1190world. Causality refers to the subclass of generative
1191models that resemble, at an abstract level, how the data
1192are actually generated. Although generative neural networks such as Deep Belief Networks (Hinton et al. 2006)
1193or variational auto-encoders (Gregor et al. 2016; Kingma
1194et al. 2014) may generate compelling handwritten digits,
1195they mark one end of the “causality spectrum,” because
1196the steps of the generative process bear little resemblance
1197to steps in the actual process of writing. In contrast, the
1198generative model for characters using BPL does resemble
1199the steps of writing, although even more causally faithful
1200models are possible.
1201Causality has been inﬂuential in theories of perception.
1202“Analysis-by-synthesis” theories of perception maintain
1203that sensory data can be more richly represented by modeling the process that generated it (Bever & Poeppel
12042010; Eden 1962; Halle & Stevens 1962; Neisser 1966).
1205Relating data to their causal source provides strong priors
1206for perception and learning, as well as a richer basis for generalizing in new ways and to new tasks. The canonical examples of this approach are speech and visual perception. For
1207example, Liberman et al. (1967) argued that the richness of
1208speech perception is best explained by inverting the production plan, at the level of vocal tract movements, to
1209explain the large amounts of acoustic variability and the
1210blending of cues across adjacent phonemes. As discussed,
1211causality does not have to be a literal inversion of the
1212actual generative mechanisms, as proposed in the motor
1213theory of speech. For the BPL of learning handwritten
1214characters, causality is operationalized by treating concepts
1215as motor programs, or abstract causal descriptions of how to
1216produce examples of the concept, rather than concrete
1217conﬁgurations of speciﬁc muscles (Fig. 5A). Causality is
1218an important factor in the model’s success in classifying
1219and generating new examples after seeing just a single
1220example of a new concept (Lake et al. 2015a) (Fig. 5B).
1221Causal knowledge has also been shown to inﬂuence how
1222people learn new concepts; providing a learner with different types of causal knowledge changes how he or she learns
1223and generalizes. For example, the structure of the causal
1224network underlying the features of a category inﬂuences
1225how people categorize new examples (Rehder 2003;
1226Rehder & Hastie 2001). Similarly, as related to the Characters Challenge, the way people learn to write a novel handwritten character inﬂuences later perception and
1227categorization (Freyd 1983; 1987).
1228To explain the role of causality in learning, conceptual
1229representations have been likened to intuitive theories or
1230explanations, providing the glue that lets core features
1231stick, whereas other equally applicable features wash
1232away (Murphy & Medin 1985). Borrowing examples from
1233Murphy and Medin (1985), the feature “ﬂammable” is
1234more closely attached to wood than money because of
1235the underlying causal roles of the concepts, even though
1236
1237the feature is equally applicable to both. These causal roles
1238derive from the functions of objects. Causality can also glue
1239some features together by relating them to a deeper underlying cause, explaining why some features such as “can ﬂy,”
1240“has wings,” and “has feathers” co-occur across objects,
1241whereas others do not.
1242Beyond concept learning, people also understand scenes
1243by building causal models. Human-level scene understanding
1244involves composing a story that explains the perceptual observations, drawing upon and integrating the ingredients of
1245intuitive physics, intuitive psychology, and compositionality.
1246Perception without these ingredients, and absent the causal
1247glue that binds them, can lead to revealing errors. Consider
1248image captions generated by a deep neural network
1249(Fig. 6) (Karpathy & Fei-Fei 2017). In many cases, the
1250network gets the key objects in a scene correct, but fails to
1251understand the physical forces at work, the mental states of
1252the people, or the causal relationships between the objects.
1253In other words, it does not build the right causal model of
1254the data.
1255There have been steps toward deep neural networks and
1256related approaches that learn causal models. Lopez-Paz
1257et al. (2015) introduced a discriminative, data-driven
1258framework for distinguishing the direction of causality
1259from examples. Although it outperforms existing methods
1260on various causal prediction tasks, it is unclear how to
1261apply the approach to inferring rich hierarchies of latent
1262causal variables, as needed for the Frostbite Challenge
1263and especially the Characters Challenge. Graves (2014)
1264learned a generative model of cursive handwriting using a
1265recurrent neural network trained on handwriting data.
1266Although it synthesizes impressive examples of handwriting
1267in various styles, it requires a large training corpus and has
1268not been applied to other tasks. The DRAW network performs both recognition and generation of handwritten
1269digits using recurrent neural networks with a window of
1270attention, producing a limited circular area of the image
1271at each time step (Gregor et al. 2015). A more recent
1272variant of DRAW was applied to generating examples of
1273a novel character from just a single training example
1274(Rezende et al. 2016). The model demonstrates an impressive ability to make plausible generalizations that go beyond
1275the training examples, yet it generalizes too broadly in other
1276cases, in ways that are not especially human-like. It is not
1277clear that it could yet pass any of the “visual Turing tests”
1278in Lake et al. (2015a) (Fig. 5B), although we hope
1279
1280DRAW-style networks will continue to be extended and
1281enriched, and could be made to pass these tests.
1282Incorporating causality may greatly improve these deep
1283learning models; they were trained without access to
1284causal data about how characters are actually produced,
1285and without any incentive to learn the true causal
1286process. An attentional window is only a crude approximation of the true causal process of drawing with a pen, and in
1287Rezende et al. (2016) the attentional window is not pen-like
1288at all, although a more accurate pen model could be incorporated. We anticipate that these sequential generative
1289neural networks could make sharper one-shot inferences,
1290with the goal of tackling the full Characters Challenge by
1291incorporating additional causal, compositional, and hierarchical structure (and by continuing to use learning-tolearn, described next), potentially leading to a more computationally efﬁcient and neurally grounded variant of the
1292BPL model of handwritten characters (Fig. 5).
1293A causal model of Frostbite would have to be more
1294complex, gluing together object representations and
1295explaining their interactions with intuitive physics and intuitive psychology, much like the game engine that generates
1296the game dynamics and, ultimately, the frames of pixel
1297images. Inference is the process of inverting this causal
1298generative model, explaining the raw pixels as objects and
1299their interactions, such as the agent stepping on an ice
1300ﬂoe to deactivate it or a crab pushing the agent into the
1301water (Fig. 2). Deep neural networks could play a role in
1302two ways: by serving as a bottom-up proposer to make
1303probabilistic inference more tractable in a structured generative model (sect. 4.3.1) or by serving as the causal generative model if imbued with the right set of ingredients.
13044.2.3. Learning-to-learn. When humans or machines make
1305inferences that go far beyond the data, strong prior knowledge (or inductive biases or constraints) must be making up
1306the difference (Geman et al. 1992; Grifﬁths et al. 2010;
1307Tenenbaum et al. 2011). One way people acquire this
1308prior knowledge is through “learning-to-learn,” a term
1309introduced by Harlow (1949) and closely related to the
1310machine learning notions of “transfer learning,” “multitask
1311learning,” and “representation learning.” These terms refer
1312to ways that learning a new task or a new concept can be
1313accelerated through previous or parallel learning of other
1314related tasks or other related concepts. The strong priors,
1315constraints, or inductive bias needed to learn a particular
1316
1317task quickly are often shared to some extent with other
1318related tasks. A range of mechanisms have been developed
1319to adapt the learner’s inductive bias as they learn speciﬁc
1320tasks and then apply these inductive biases to new tasks.
1321In hierarchical Bayesian modeling (Gelman et al. 2004),
1322a general prior on concepts is shared by multiple speciﬁc
1323concepts, and the prior itself is learned over the course of
1324learning the speciﬁc concepts (Salakhutdinov et al. 2012;
13252013). These models have been used to explain the dynamics of human learning-to-learn in many areas of cognition,
1326including word learning, causal learning, and learning intuitive theories of physical and social domains (Tenenbaum
1327et al. 2011). In machine vision, for deep convolutional networks or other discriminative methods that form the core of
1328recent recognition systems, learning-to-learn can occur
1329through the sharing of features between the models
1330learned for old objects or old tasks and the models
1331learned for new objects or new tasks (Anselmi et al. 2016;
1332Baxter 2000; Bottou 2014; Lopez-Paz et al. 2016; Rusu
1333et al. 2016; Salakhutdinov et al. 2011; Srivastava & Salakhutdinov, 2013; Torralba et al. 2007; Zeiler & Fergus
13342014). Neural networks can also learn-to-learn by optimizing hyper-parameters, including the form of their weight
1335update rule (Andrychowicz et al. 2016), over a set of
1336related tasks.
1337Although transfer learning and multitask learning are
1338already important themes across AI, and in deep learning
1339in particular, they have not yet led to systems that learn
1340new tasks as rapidly and ﬂexibly as humans do. Capturing
1341more human-like learning-to-learn dynamics in deep networks and other machine learning approaches could facilitate much stronger transfer to new tasks and new problems.
1342To gain the full beneﬁt that humans get from learning-tolearn, however, AI systems might ﬁrst need to adopt the
1343more compositional (or more language-like, see sect. 5)
1344and causal forms of representations that we have argued
1345for above.
1346We can see this potential in both of our challenge problems. In the Characters Challenge as presented in Lake
1347et al. (2015a), all viable models use “pre-training” on
1348many character concepts in a background set of alphabets
1349to tune the representations they use to learn new character
1350concepts in a test set of alphabets. But to perform well,
1351current neural network approaches require much more
1352pre-training than do people or our Bayesian program learning approach. Humans typically learn only one or a few
1353alphabets, and even with related drawing experience, this
1354likely amounts to the equivalent of a few hundred character-like visual concepts at most. For BPL, pre-training
1355with characters in only ﬁve alphabets (for around 150 character types in total) is sufﬁcient to perform human-level
1356one-shot classiﬁcation and generation of new examples.
1357With this level of pre-training, current neural networks
1358perform much worse on classiﬁcation and have not even
1359attempted generation; they are still far from solving the
1360Characters Challenge.8
1361We cannot be sure how people get to the knowledge they
1362have in this domain, but we do understand how this works
1363in BPL, and we think people might be similar. BPL transfers readily to new concepts because it learns about object
1364parts, sub-parts, and relations, capturing learning about
1365what each concept is like and what concepts are like in
1366general. It is crucial that learning-to-learn occurs at multiple levels of the hierarchical generative process. Previously
1367
1368
1369learned primitive actions and larger generative pieces can
1370be re-used and re-combined to deﬁne new generative
1371models for new characters (Fig. 5A). Further transfer
1372occurs by learning about the typical levels of variability
1373within a typical generative model. This provides knowledge
1374about how far and in what ways to generalize when we have
1375seen only one example of a new character, which on its own
1376could not possibly carry any information about variance.
1377BPL could also beneﬁt from deeper forms of learning-tolearn than it currently does. Some of the important structure it exploits to generalize well is built in to the prior
1378and not learned from the background pre-training,
1379whereas people might learn this knowledge, and ultimately,
1380a human-like machine learning system should as well.
1381Analogous learning-to-learn occurs for humans in learning many new object models, in vision and cognition: Consider the novel two-wheeled vehicle in Figure 1B, where
1382learning-to-learn can operate through the transfer of previously learned parts and relations (sub-concepts such as
1383wheels, motors, handle bars, attached, powered by) that
1384reconﬁgure compositionally to create a model of the
1385new concept. If deep neural networks could adopt
1386similarly compositional, hierarchical, and causal representations, we expect they could beneﬁt more from learningto-learn.
1387In the Frostbite Challenge, and in video games more
1388generally, there is a similar interdependence between
1389the form of the representation and the effectiveness of
1390learning-to-learn. People seem to transfer knowledge at
1391multiple levels, from low-level perception to high-level
1392strategy, exploiting compositionality at all levels. Most
1393basically, they immediately parse the game environment
1394into objects, types of objects, and causal relations
1395between them. People also understand that video games
1396like these have goals, which often involve approaching or
1397avoiding objects based on their type. Whether the
1398person is a child or a seasoned gamer, it seems obvious
1399that interacting with the birds and ﬁsh will change the
1400game state in some way, either good or bad, because
1401video games typically yield costs or rewards for these
1402types of interactions (e.g., dying or points). These types
1403of hypotheses can be quite speciﬁc and rely on prior
1404knowledge: When the polar bear ﬁrst appears and tracks
1405the agent’s location during advanced levels (Fig. 2D), an
1406attentive learner is sure to avoid it. Depending on the
1407level, ice ﬂoes can be spaced far apart (Fig. 2A–C) or
1408close together (Fig. 2D), suggesting the agent may be
1409able to cross some gaps, but not others. In this way,
1410general world knowledge and previous video games may
1411help inform exploration and generalization in new scenarios, helping people learn maximally from a single mistake
1412or avoid mistakes altogether.
1413Deep reinforcement learning systems for playing Atari
1414games have had some impressive successes in transfer
1415learning, but they still have not come close to learning to
1416play new games as quickly as humans can. For example,
1417Parisotto et al. (2016) present the “actor-mimic” algorithm
1418that ﬁrst learns 13 Atari games by watching an expert
1419network play and trying to mimic the expert network
1420action selection and/or internal states (for about 4 million
1421frames of experience each, or 18.5 hours per game). This
1422algorithm can then learn new games faster than a randomly
1423initialized DQN: Scores that might have taken 4 or 5
1424million frames of learning to reach might now be reached
1425
1426after 1 or 2 million frames of practice. But anecdotally, we
1427ﬁnd that humans can still reach these scores with a few
1428minutes of practice, requiring far less experience than the
1429DQNs.
1430In sum, the interaction between representation and previous experience may be key to building machines that
1431learn as fast as people. A deep learning system trained
1432on many video games may not, by itself, be enough to
1433learn new games as quickly as people. Yet, if such a
1434system aims to learn compositionally structured causal
1435models of each game – built on a foundation of intuitive
1436physics and psychology – it could transfer knowledge
1437more efﬁciently and thereby learn new games much
1438more quickly.
14394.3. Thinking Fast
1440
1441The previous section focused on learning rich models from
1442sparse data and proposed ingredients for achieving these
1443human-like learning abilities. These cognitive abilities are
1444even more striking when considering the speed of perception and thought: the amount of time required to understand a scene, think a thought, or choose an action. In
1445general, richer and more structured models require more
1446complex and slower inference algorithms, similar to how
1447complex models require more data, making the speed of
1448perception and thought all the more remarkable.
1449The combination of rich models with efﬁcient inference
1450suggests another way psychology and neuroscience may
1451usefully inform AI. It also suggests an additional way to
1452build on the successes of deep learning, where efﬁcient
1453inference and scalable learning are important strengths of
1454the approach. This section discusses possible paths toward
1455resolving the conﬂict between fast inference and structured
1456representations, including Helmholtz machine–style
1457approximate inference in generative models (Dayan et al.
14581995; Hinton et al. 1995) and cooperation between
1459model-free and model-based reinforcement learning
1460systems.
14614.3.1. Approximate inference in structured models. Hierarhical Bayesian models operating over probabilistic
1462programs (Goodman et al. 2008; Lake et al. 2015a; Tenenbaum et al. 2011) are equipped to deal with theory-like
1463structures and rich causal representations of the world,
1464yet there are formidable algorithmic challenges for efﬁcient
1465inference. Computing a probability distribution over an
1466entire space of programs is usually intractable, and often
1467even ﬁnding a single high-probability program poses an
1468intractable search problem. In contrast, whereas representing intuitive theories and structured causal models is less
1469natural in deep neural networks, recent progress has demonstrated the remarkable effectiveness of gradient-based
1470learning in high-dimensional parameter spaces. A complete
1471account of learning and inference must explain how the
1472brain does so much with limited computational resources
1473(Gershman et al. 2015; Vul et al. 2014).
1474Popular algorithms for approximate inference in probabilistic machine learning have been proposed as psychological models (see Grifﬁths et al. [2012] for a review). Most
1475prominently, it has been proposed that humans can approximate Bayesian inference using Monte Carlo methods,
1476which stochastically sample the space of possible hypotheses and evaluate these samples according to their
1477
1478consistency with the data and prior knowledge (Bonawitz
1479et al. 2014; Gershman et al. 2012; Ullman et al. 2012b;
1480Vul et al. 2014). Monte Carlo sampling has been invoked
1481to explain behavioral phenomena ranging from children’s
1482response variability (Bonawitz et al. 2014), to garden-path
1483effects in sentence processing (Levy et al. 2009) and
1484perceptual multistability (Gershman et al. 2012; MorenoBote et al. 2011). Moreover, we are beginning to understand how such methods could be implemented in neural
1485circuits (Buesing et al. 2011; Huang & Rao 2014; Pecevski
1486et al. 2011).9
1487Although Monte Carlo methods are powerful and come
1488with asymptotic guarantees, it is challenging to make them
1489work on complex problems like program induction and
1490theory learning. When the hypothesis space is vast, and
1491only a few hypotheses are consistent with the data, how
1492can good models be discovered without exhaustive
1493search? In at least some domains, people may not have
1494an especially clever solution to this problem, instead grappling with the full combinatorial complexity of theory learning (Ullman et al. 2012b). Discovering new theories can be
1495slow and arduous, as testiﬁed by the long time scale of cognitive development, and learning in a saltatory fashion
1496(rather than through gradual adaptation) is characteristic
1497of aspects of human intelligence, including discovery and
1498insight during development (Schulz 2012b), problemsolving (Sternberg & Davidson 1995), and epoch-making
1499discoveries in scientiﬁc research (Langley et al. 1987). Discovering new theories can also occur much more quickly. A
1500person learning the rules of Frostbite will probably
1501undergo a loosely ordered sequence of “Aha!” moments:
1502He or she will learn that jumping on ice ﬂoes causes
1503them to change color, that changing the color of ice ﬂoes
1504causes an igloo to be constructed piece-by-piece, that
1505birds make him or her lose points, that ﬁsh make him or
1506her gain points, that he or she can change the direction
1507of ice ﬂoes at the cost of one igloo piece, and so on.
1508These little fragments of a “Frostbite theory” are assembled
1509to form a causal understanding of the game relatively
1510quickly, in what seems more like a guided process than
1511arbitrary proposals in a Monte Carlo inference scheme.
1512Similarly, as described in the Characters Challenge,
1513people can quickly infer motor programs to draw a new
1514character in a similarly guided processes.
1515For domains where program or theory learning occurs
1516quickly, it is possible that people employ inductive biases
1517not only to evaluate hypotheses, but also to guide hypothesis selection. Schulz (2012b) has suggested that abstract
1518structural properties of problems contain information
1519about the abstract forms of their solutions. Even without
1520knowing the answer to the question, “Where is the
1521deepest point in the Paciﬁc Ocean?” one still knows that
1522the answer must be a location on a map. The answer “20
1523inches” to the question, “What year was Lincoln born?”
1524can be invalidated a priori, even without knowing the
1525correct answer. In recent experiments, Tsividis et al.
1526(2015) found that children can use high-level abstract features of a domain to guide hypothesis selection, by reasoning about distributional properties like the ratio of seeds to
1527ﬂowers, and dynamical properties like periodic or monotonic relationships between causes and effects (see also
1528Magid et al. 2015).
1529How might efﬁcient mappings from questions to a plausible subset of answers be learned? Recent work in AI,
1530
1531
1532spanning both deep learning and graphical models, has
1533attempted to tackle this challenge by “amortizing” probabilistic inference computations into an efﬁcient feed-forward
1534mapping (Eslami et al. 2014; Heess et al. 2013; Mnih &
1535Gregor, 2014; Stuhlmüller et al. 2013). We can also think
1536of this as “learning to do inference,” which is independent
1537from the ideas of learning as model building discussed in
1538the previous section. These feed-forward mappings can
1539be learned in various ways, for example, using paired generative/recognition networks (Dayan et al. 1995; Hinton
1540et al. 1995) and variational optimization (Gregor et al.
15412015; Mnih & Gregor 2014; Rezende et al. 2014), or
1542nearest-neighbor density estimation (Kulkarni et al.
15432015a; Stuhlmüller et al. 2013). One implication of amortization is that solutions to different problems will become
1544correlated because of the sharing of amortized computations. Some evidence for inferential correlations in
1545humans was reported by Gershman and Goodman
1546(2014). This trend is an avenue of potential integration of
1547deep learning models with probabilistic models and probabilistic programming: Training neural networks to help
1548perform probabilistic inference in a generative model or
1549a probabilistic program (Eslami et al. 2016; Kulkarni
1550et al. 2015b; Yildirim et al. 2015). Another avenue for
1551potential integration is through differentiable programming (Dalrymple 2016), by ensuring that the programlike hypotheses are differentiable and thus learnable via
1552gradient descent – a possibility discussed in the concluding
1553section (Section 6.1).
15544.3.2. Model-based and model-free reinforcement
1555learning. The DQN introduced by Mnih et al. (2015)
1556
1557used a simple form of model-free reinforcement learning
1558in a deep neural network that allows for fast selection of
1559actions. There is indeed substantial evidence that the
1560brain uses similar model-free learning algorithms in
1561simple associative learning or discrimination learning
1562tasks (see Niv 2009, for a review). In particular, the
1563phasic ﬁring of midbrain dopaminergic neurons is qualitatively (Schultz et al. 1997) and quantitatively (Bayer &
1564Glimcher 2005) consistent with the reward prediction
1565error that drives updating of model-free value estimates.
1566Model-free learning is not, however, the whole story.
1567Considerable evidence suggests that the brain also has a
1568model-based learning system, responsible for building a
1569“cognitive map” of the environment and using it to plan
1570action sequences for more complex tasks (Daw et al.
15712005; Dolan & Dayan 2013). Model-based planning is an
1572essential ingredient of human intelligence, enabling ﬂexible
1573adaptation to new tasks and goals; it is where all of the rich
1574model-building abilities discussed in the previous sections
1575earn their value as guides to action. As we argued in our discussion of Frostbite, one can design numerous variants of
1576this simple video game that are identical except for the
1577reward function; that is, governed by an identical environment model of state-action–dependent transitions. We
1578conjecture that a competent Frostbite player can easily
1579shift behavior appropriately, with little or no additional
1580learning, and it is hard to imagine a way of doing that
1581other than having a model-based planning approach in
1582which the environment model can be modularly combined
1583with arbitrary new reward functions and then deployed
1584immediately for planning. One boundary condition on
1585this ﬂexibility is the fact that the skills become “habitized”
1586
1587with routine application, possibly reﬂecting a shift from
1588model-based to model-free control. This shift may arise
1589from a rational arbitration between learning systems to
1590balance the trade-off between ﬂexibility and speed (Daw
1591et al. 2005; Keramati et al. 2011).
1592Similarly to how probabilistic computations can be amortized for efﬁciency (see previous section), plans can be
1593amortized into cached values by allowing the modelbased system to simulate training data for the model-free
1594system (Sutton 1990). This process might occur ofﬂine
1595(e.g., in dreaming or quiet wakefulness), suggesting a
1596form of consolidation in reinforcement learning (Gershman
1597et al. 2014). Consistent with the idea of cooperation
1598between learning systems, a recent experiment demonstrated that model-based behavior becomes automatic
1599over the course of training (Economides et al. 2015).
1600Thus, a marriage of ﬂexibility and efﬁciency might be
1601achievable if we use the human reinforcement learning
1602systems as guidance.
1603Intrinsic motivation also plays an important role in
1604human learning and behavior (Berlyne 1966; Harlow
16051950; Ryan & Deci 2007). Although much of the previous
1606discussion assumes the standard view of behavior as seeking
1607to maximize reward and minimize punishment, all externally provided rewards are reinterpreted according to the
1608“internal value” of the agent, which may depend on the
1609current goal and mental state. There may also be an intrinsic drive to reduce uncertainty and construct models of the
1610environment (Edelman 2015; Schmidhuber 2015), closely
1611related to learning-to-learn and multitask learning. Deep
1612reinforcement learning is only just starting to address
1613intrinsically motivated learning (Kulkarni et al. 2016;
1614Mohamed & Rezende 2015).
1615
16165. Responses to common questions
1617In disussing the arguments in this article with colleagues,
1618three lines of questioning or critiques have frequently
1619arisen. We think it is helpful to address these points
1620directly, to maximize the potential for moving forward
1621together.
16225.1. Comparing the learning speeds of humans and
1623neural networks on speciﬁc tasks is not meaningful,
1624because humans have extensive prior experience
1625
1626It may seem unfair to compare neural networks and
1627humans on the amount of training experience required to
1628perform a task, such as learning to play new Atari games
1629or learning new handwritten characters, when humans
1630have had extensive prior experience that these networks
1631have not beneﬁted from. People have had many hours
1632playing other games, and experience reading or writing
1633many other handwritten characters, not to mention experience in a variety of more loosely related tasks. If neural networks were “pre-trained” on the same experience, the
1634argument goes, then they might generalize similarly to
1635humans when exposed to novel tasks.
1636This has been the rationale behind multitask learning or
1637transfer learning, a strategy with a long history that has
1638shown some promising results recently with deep networks
1639(e.g., Donahue et al. 2014; Luong et al. 2015; Parisotto
1640et al. 2016). Furthermore, some deep learning advocates
1641
1642argue the human brain effectively beneﬁts from even more
1643experience through evolution. If deep learning researchers
1644see themselves as trying to capture the equivalent of
1645humans’ collective evolutionary experience, this would be
1646equivalent to a truly immense “pre-training” phase.
1647We agree that humans have a much richer starting point
1648than neural networks when learning most new tasks, including learning a new concept or learning to play a new video
1649game. That is the point of the “developmental start-up software” and other building blocks that we argued are key to
1650creating this richer starting point. We are less committed
1651to a particular story regarding the origins of the ingredients,
1652including the relative roles of genetically programmed and
1653experience-driven developmental mechanisms in building
1654these components in early infancy. Either way, we see
1655them as fundamental building blocks for facilitating rapid
1656learning from sparse data.
1657Learning-to-learn across multiple tasks is conceivably
1658one route to acquiring these ingredients, but simply training conventional neural networks on many related tasks
1659may not be sufﬁcient to generalize in human-like ways for
1660novel tasks. As we argued in Section 4.2.3, successful learning-to-learn – or, at least, human-level transfer learning – is
1661enabled by having models with the right representational
1662structure, including the other building blocks discussed in
1663this article. Learning-to-learn is a powerful ingredient,
1664but it can be more powerful when operating over compositional representations that capture the underlying causal
1665structure of the environment, while also building on intuitive physics and psychology.
1666Finally, we recognize that some researchers still hold out
1667hope that if only they can just get big enough training data
1668sets, sufﬁciently rich tasks, and enough computing power –
1669far beyond what has been tried out so far – then deep learning methods might be sufﬁcient to learn representations
1670equivalent to what evolution and learning provide
1671humans. We can sympathize with that hope, and believe
1672it deserves further exploration, although we are not sure
1673it is a realistic one. We understand in principle how evolution could build a brain with the cognitive ingredients we
1674discuss here. Stochastic hill climbing is slow. It may
1675require massively parallel exploration, over millions of
1676years with innumerable dead ends, but it can build
1677complex structures with complex functions if we are
1678willing to wait long enough. In contrast, trying to build
1679these representations from scratch using backpropagation,
1680Deep Q-learning, or any stochastic gradient-descent weight
1681update rule in a ﬁxed network architecture, may be unfeasible regardless of how much training data are available. To
1682build these representations from scratch might require
1683exploring fundamental structural variations in the network’s architecture, which gradient-based learning in
1684weight space is not prepared to do. Although deep learning
1685researchers do explore many such architectural variations,
1686and have been devising increasingly clever and powerful
1687ones recently, it is the researchers who are driving and
1688directing this process. Exploration and creative innovation
1689in the space of network architectures have not yet been
1690made algorithmic. Perhaps they could, using genetic programming methods (Koza 1992) or other structure-search
1691algorithms (Yamins et al. 2014). We think this would be a
1692fascinating and promising direction to explore, but we
1693may have to acquire more patience than machine-learning
1694researchers typically express with their algorithms: the
1695
1696dynamics of structure search may look much more like
1697the slow random hill climbing of evolution than the
1698smooth, methodical progress of stochastic gradient
1699descent. An alternative strategy is to build in appropriate
1700infant-like knowledge representations and core ingredients
1701as the starting point for our learning-based AI systems, or to
1702build learning systems with strong inductive biases that
1703guide them in this direction.
1704Regardless of which way an AI developer chooses to go,
1705our main points are orthogonal to this objection. There are
1706a set of core cognitive ingredients for human-like learning
1707and thought. Deep learning models could incorporate
1708these ingredients through some combination of additional
1709structure and perhaps additional learning mechanisms,
1710but for the most part have yet to do so. Any approach to
1711human-like AI, whether based on deep learning or not, is
1712likely to gain from incorporating these ingredients.
17135.2. Biological plausibility suggests theories of
1714intelligence should start with neural networks
1715
1716We have focused on how cognitive science can motivate
1717and guide efforts to engineer human-like AI, in contrast
1718to some advocates of deep neural networks who cite neuroscience for inspiration. Our approach is guided by a pragmatic view that the clearest path to a computational
1719formalization of human intelligence comes from understanding the “software” before the “hardware.” In the
1720case of this article, we proposed key ingredients of this software in previous sections.
1721Nonetheless, a cognitive approach to intelligence should
1722not ignore what we know about the brain. Neuroscience
1723can provide valuable inspirations for both cognitive
1724models and AI researchers: The centrality of neural networks and model-free reinforcement learning in our proposals for “thinking fast” (sect. 4.3) are prime exemplars.
1725Neuroscience can also, in principle, impose constraints on
1726cognitive accounts, at both the cellular and systems levels.
1727If deep learning embodies brain-like computational mechanisms and those mechanisms are incompatible with some
1728cognitive theory, then this is an argument against that cognitive theory and in favor of deep learning. Unfortunately,
1729what we “know” about the brain is not all that clear-cut.
1730Many seemingly well-accepted ideas regarding neural computation are in fact biologically dubious, or uncertain at
1731best, and therefore should not disqualify cognitive ingredients that pose challenges for implementation within that
1732approach.
1733For example, most neural networks use some form of
1734gradient-based (e.g., backpropagation) or Hebbian learning. It has long been argued, however, that backpropagation is not biologically plausible. As Crick (1989) famously
1735pointed out, backpropagation seems to require that information be transmitted backward along the axon, which
1736does not ﬁt with realistic models of neuronal function
1737(although recent models circumvent this problem in
1738various ways [Liao et al. 2015; Lillicrap et al. 2014; Scellier
1739& Bengio 2016]). This has not prevented backpropagation
1740from being put to good use in connectionist models of cognition or in building deep neural networks for AI. Neural
1741network researchers must regard it as a very good thing,
1742in this case, that concerns of biological plausibility did not
1743hold back research on this particular algorithmic approach
1744to learning.10 We strongly agree: Although neuroscientists
1745
1746have not found any mechanisms for implementing backpropagation in the brain, neither have they produced deﬁnitive
1747evidence against it. The existing data simply offer little constraint either way, and backpropagation has been of obviously great value in engineering today’s best pattern
1748recognition systems.
1749Hebbian learning is another case in point. In the form of
1750long-term potentiation (LTP) and spike-timing dependent
1751plasticity (STDP), Hebbian learning mechanisms are
1752often cited as biologically supported (Bi & Poo 2001).
1753However, the cognitive signiﬁcance of any biologically
1754grounded form of Hebbian learning is unclear. Gallistel
1755and Matzel (2013) have persuasively argued that the critical
1756interstimulus interval for LTP is orders of magnitude
1757smaller than the intervals that are behaviorally relevant in
1758most forms of learning. In fact, experiments that simultaneously manipulate the interstimulus and intertrial intervals
1759demonstrate that no critical interval exists. Behavior can
1760persist for weeks or months, whereas LTP decays to baseline over the course of days (Power et al. 1997). Learned
1761behavior is rapidly re-acquired after extinction (Bouton
17622004), whereas no such facilitation is observed for LTP
1763(Jonge & Racine 1985). Most relevantly for our focus, it
1764would be especially challenging to try to implement the
1765ingredients described in this article using purely Hebbian
1766mechanisms.
1767Claims of biological plausibility or implausibility usually
1768rest on rather stylized assumptions about the brain that
1769are wrong in many of their details. Moreover, these
1770claims usually pertain to the cellular and synaptic levels,
1771with few connections made to systems-level neuroscience
1772and subcortical brain organization (Edelman 2015). Understanding which details matter and which do not requires a
1773computational theory (Marr 1982). Moreover, in the
1774absence of strong constraints from neuroscience, we can
1775turn the biological argument around: Perhaps a hypothetical biological mechanism should be viewed with skepticism
1776if it is cognitively implausible. In the long run, we are optimistic that neuroscience will eventually place more constraints on theories of intelligence. For now, we believe
1777cognitive plausibility offers a surer foundation.
1778
17795.3. Language is essential for human intelligence. Why is
1780it not more prominent here?
1781
1782
1783We have said little in this article about people’s ability to
1784communicate and think in natural language, a distinctively
1785human cognitive capacity where machine capabilities strikingly lag. Certainly one could argue that language should be
1786included on any short list of key ingredients in human intelligence: For example, Mikolov et al. (2016) featured language prominently in their recent paper sketching
1787challenge problems and a road map for AI. Moreover,
1788whereas natural language processing is an active area of
1789research in deep learning (e.g., Bahdanau et al. 2015;
1790Mikolov et al. 2013; Xu et al. 2015), it is widely recognized
1791that neural networks are far from implementing human
1792language abilities. The question is, how do we develop
1793machines with a richer capacity for language?
1794We believe that understanding language and its role in
1795intelligence goes hand-in-hand with understanding the
1796building blocks discussed in this article. It is also true that
1797language builds on the core abilities for intuitive physics,
1798intuitive psychology, and rapid learning with compositional,
1799
1800causal models that we focus on. These capacities are in
1801place before children master language, and they provide
1802the building blocks for linguistic meaning and language
1803acquisition (Carey 2009; Jackendoff 2003; Kemp 2007;
1804O’Donnell 2015; Pinker 2007; Xu & Tenenbaum 2007).
1805We hope that by better understanding these earlier ingredients and how to implement and integrate them computationally, we will be better positioned to understand
1806linguistic meaning and acquisition in computational terms
1807and to explore other ingredients that make human language
1808possible.
1809What else might we need to add to these core ingredients to get language? Many researchers have speculated
1810about key features of human cognition that give rise to language and other uniquely human modes of thought: Is it
1811recursion, or some new kind of recursive structure building
1812ability (Berwick & Chomsky 2016; Hauser et al. 2002)? Is it
1813the ability to re-use symbols by name (Deacon 1998)? Is it
1814the ability to understand others intentionally and build
1815shared intentionality (Bloom 2000; Frank et al. 2009; Tomasello 2010)? Is it some new version of these things, or is it
1816just more of the aspects of these capacities that are already
1817present in infants? These are important questions for
1818future work with the potential to expand the list of key
1819ingredients; we did not intend our list to be complete.
1820Finally, we should keep in mind all of the ways that
1821acquiring language extends and enriches the ingredients
1822of cognition that we focus on in this article. The intuitive
1823physics and psychology of infants are likely limited to reasoning about objects and agents in their immediate
1824spatial and temporal vicinity and to their simplest properties and states. But with language, older children become
1825able to reason about a much wider range of physical and
1826psychological situations (Carey 2009). Language also facilitates more powerful learning-to-learn and compositionality (Mikolov et al. 2016), allowing people to learn more
1827quickly and ﬂexibly by representing new concepts and
1828thoughts in relation to existing concepts (Lupyan &
1829Bergen 2016; Lupyan & Clark 2015). Ultimately, the full
1830project of building machines that learn and think like
1831humans must have language at its core.
1832
1833
1834
18356. Looking forward
1836
1837In the last few decades, AI and machine learning have
1838made remarkable progress: Computer programs beat
1839chess masters; AI systems beat Jeopardy champions; apps
1840recognize photos of your friends; machines rival humans
1841on large-scale object recognition; smart phones recognize
1842(and, to a limited extent, understand) speech. The
1843coming years promise still more exciting AI applications,
1844in areas as varied as self-driving cars, medicine, genetics,
1845drug design, and robotics. As a ﬁeld, AI should be proud
1846of these accomplishments, which have helped move
1847research from academic journals into systems that
1848improve our daily lives.
1849We should also be mindful of what AI has and has not
1850achieved. Although the pace of progress has been impressive, natural intelligence is still by far the best example of
1851intelligence. Machine performance may rival or exceed
1852human performance on particular tasks, and algorithms
1853may take inspiration from neuroscience or aspects of psychology, but it does not follow that the algorithm learns
1854
1855or thinks like a person. This is a higher bar worth reaching
1856for, potentially leading to more powerful algorithms, while
1857also helping unlock the mysteries of the human mind.
1858When comparing people with the current best algorithms in AI and machine learning, people learn from
1859fewer data and generalize in richer and more ﬂexible
1860ways. Even for relatively simple concepts such as handwritten characters, people need to see just one or a few examples of a new concept before being able to recognize new
1861examples, generate new examples, and generate new concepts based on related ones (Fig. 1A). So far, these abilities
1862elude even the best deep neural networks for character recognition (Ciresan et al. 2012), which are trained on many
1863examples of each concept and do not ﬂexibly generalize
1864to new tasks. We suggest that the comparative power and
1865ﬂexibility of people’s inferences come from the causal
1866and compositional nature of their representations.
1867We believe that deep learning and other learning paradigms can move closer to human-like learning and
1868thought if they incorporate psychological ingredients,
1869including those outlined in this article. Before closing, we
1870discuss some recent trends that we see as some of the
1871most promising developments in deep learning – trends
1872we hope will continue and lead to more important
1873advances.
1874
18756.1. Promising directions in deep learning
1876
1877There has been recent interest in integrating psychological
1878ingredients with deep neural networks, especially selective
1879attention (Bahdanau et al. 2015; Mnih et al. 2014; Xu et al.
18802015), augmented working memory (Graves et al. 2014;
18812016; Grefenstette et al. 2015; Sukhbaatar et al. 2015;
1882Weston et al. 2015b), and experience replay (McClelland
1883et al. 1995; Mnih et al. 2015). These ingredients are
1884lower-level than the key cognitive ingredients discussed
1885in this article. yet they suggest a promising trend of using
1886insights from cognitive psychology to improve deep learning, one that may be even furthered by incorporating
1887higher-level cognitive ingredients.
1888Paralleling the human perceptual apparatus, selective
1889attention forces deep learning models to process raw, perceptual data as a series of high-resolution “foveal glimpses”
1890rather than all at once. Somewhat surprisingly, the incorporation of attention has led to substantial performance gains
1891in a variety of domains, including in machine translation
1892(Bahdanau et al. 2015), object recognition (Mnih et al.
18932014), and image caption generation (Xu et al. 2015).
1894Attention may help these models in several ways. It helps
1895to coordinate complex, often sequential, outputs by attending to only speciﬁc aspects of the input, allowing the model
1896to focus on smaller sub-tasks rather than solving an entire
1897problem in one shot. For example, during caption generation, the attentional window has been shown to track the
1898objects as they are mentioned in the caption, where the
1899network may focus on a boy and then a Frisbee when producing a caption like, “A boy throws a Frisbee” (Xu et al.
19002015). Attention also allows larger models to be trained
1901without requiring every model parameter to affect every
1902output or action. In generative neural network models,
1903attention has been used to concentrate on generating particular regions of the image rather than the whole image at
1904once (Gregor et al. 2015). This could be a stepping stone
1905toward building more causal generative models in neural
1906
1907networks, such as a neural version of the Bayesian
1908program learning model that could be applied to tackling
1909the Characters Challenge (sect. 3.1).
1910Researchers are also developing neural networks with
1911“working memories” that augment the shorter-term
1912memory provided by unit activation and the longer-term
1913memory provided by the connection weights (Graves et al.
19142014; 2016; Grefenstette et al. 2015; Reed & Freitas 2016;
1915Sukhbaatar et al. 2015; Weston et al. 2015b). These developments are also part of a broader trend toward “differentiable
1916programming,” the incorporation of classic data structures,
1917such as random access memory, stacks, and queues, into gradient-based learning systems (Dalrymple 2016). For
1918example, the neural Turing machine (NTM) (Graves et al.
19192014) and its successor the differentiable neural computer
1920(DNC) (Graves et al. 2016) are neural networks augmented
1921with a random access external memory with read and write
1922operations that maintain end-to-end differentiability. The
1923NTM has been trained to perform sequence-to-sequence
1924prediction tasks such as sequence copying and sorting, and
1925the DNC has been applied to solving block puzzles and
1926ﬁnding paths between nodes in a graph after memorizing
1927the graph. Additionally, neural programmer-interpreters
1928learn to represent and execute algorithms such as addition
1929and sorting from fewer examples, by observing inputoutput pairs (like the NTM and DNC), as well as execution
1930traces (Reed & Freitas 2016). Each model seems to learn
1931genuine programs from examples, albeit in a representation
1932more like assembly language than a high-level programming
1933language.
1934Although this new generation of neural networks has yet to
1935tackle the types of challenge problems introduced in this
1936article, differentiable programming suggests the intriguing possibility of combining the best of program induction and deep
1937learning. The types of structured representations and model
1938building ingredients discussed in this article – objects,
1939forces, agents, causality, and compositionality – help
1940explain important facets of human learning and thinking,
1941yet they also bring challenges for performing efﬁcient
1942inference (sect. 4.3.1). Deep learning systems have not yet
1943shown they can work with these representations, but they
1944have demonstrated the surprising effectiveness of gradient
1945descent in large models with high-dimensional parameter
1946spaces. A synthesis of these approaches, able to perform efﬁcient inference over programs that richly model the causal
1947structure an infant sees in the world, would be a major
1948step forward in building human-like AI.
1949Another example of combining pattern recognition and
1950model-based search comes from recent AI research into
1951the game Go. Go is considerably more difﬁcult for AI
1952than chess, and it was only recently that a computer
1953program – AlphaGo – ﬁrst beat a world-class player
1954(Chouard 2016) by using a combination of deep convolutional neural networks (ConvNets) and Monte-Carlo Tree
1955Search (Silver et al. 2016). Each of these components has
1956made gains against artiﬁcial and real Go players (Gelly &
1957Silver 2008; 2011; Silver et al. 2016; Tian & Zhu 2016),
1958and the notion of combining pattern recognition and
1959model-based search goes back decades in Go and other
1960games. Showing that these approaches can be integrated
1961to beat a human Go champion is an important AI accomplishment (see Fig. 7). Just as important, however, are
1962the new questions and directions they open up for the
1963long-term project of building genuinely human-like AI.
1964
1965One worthy goal would be to build an AI system that
1966beats a world-class player with the amount and kind of
1967training human champions receive, rather than overpowering them with Google-scale computational resources.
1968AlphaGo is initially trained on 28.4 million positions and
1969moves from 160,000 unique games played by human
1970experts; it then improves through reinforcement learning,
1971playing 30 million more games against itself. Between the
1972publication of Silver et al. (2016) and facing world champion Lee Sedol, AlphaGo was iteratively retrained several
1973times in this way. The basic system always learned from
197430 million games, but it played against successively stronger
1975versions of itself, effectively learning from 100 million or
1976more games altogether (D. Silver, personal communication, 2017). In contrast, Lee has probably played around
197750,000 games in his entire life. Looking at numbers like
1978these, it is impressive that Lee can even compete with
1979AlphaGo. What would it take to build a professional-level
1980Go AI that learns from only 50,000 games? Perhaps a
1981system that combines the advances of AlphaGo with
1982some of the complementary ingredients for intelligence
1983we argue for here would be a route to that end.
1984Artiﬁcial intelligence could also gain much by trying to
1985match the learning speed and ﬂexibility of normal human
1986Go players. People take a long time to master the game
1987of Go, but as with the Frostbite and Characters challenges
1988(sects. 3.1 and 3.2), humans can quickly learn the basics of
1989the game through a combination of explicit instruction,
1990watching others, and experience. Playing just a few games
1991teaches a human enough to beat someone who has just
1992learned the rules but never played before. Could
1993AlphaGo model these earliest stages of real human learning
1994curves? Human Go players can also adapt what they have
1995learned to innumerable game variants. The Wikipedia
1996
1997page “Go variants” describes versions such as playing
1998on bigger or smaller board sizes (ranging from 9 × 9 to
199938 × 38, not just the usual 19 × 19 board), or playing on
2000boards of different shapes and connectivity structures (rectangles, triangles, hexagons, even a map of the English city
2001Milton Keynes). The board can be a torus, a mobius strip, a
2002cube, or a diamond lattice in three dimensions. Holes can
2003be cut in the board, in regular or irregular ways. The
2004rules can be adapted to what is known as First Capture
2005Go (the ﬁrst player to capture a stone wins), NoGo (the
2006player who avoids capturing any enemy stones longer
2007wins), or Time Is Money Go (players begin with a ﬁxed
2008amount of time and at the end of the game, the number
2009of seconds remaining on each player’s clock is added to
2010his or her score). Players may receive bonuses for creating
2011certain stone patterns or capturing territory near certain
2012landmarks. There could be four or more players, competing
2013individually or in teams. In each of these variants, effective
2014play needs to change from the basic game, but a skilled
2015player can adapt, and does not simply have to relearn the
2016game from scratch. Could AlphaGo quickly adapt to new
2017variants of Go? Although techniques for handling variable-sized inputs in ConvNets may help in playing on different board sizes (Sermanet et al. 2014), the value
2018functions and policies that AlphaGo learns seem unlikely
2019to generalize as ﬂexibly and automatically as people.
2020Many of the variants described above would require signiﬁcant reprogramming and retraining, directed by the smart
2021humans who programmed AlphaGo, not the system itself.
2022As impressive as AlphaGo is in beating the world’s best
2023players at the standard game – and it is extremely impressive – the fact that it cannot even conceive of these variants,
2024let alone adapt to them autonomously, is a sign that it does
2025not understand the game as humans do. Human players can
2026
2027understand these variants and adapt to them because they
2028explicitly represent Go as a game, with a goal to beat an
2029adversary who is playing to achieve the same goal he or
2030she is, governed by rules about how stones can be placed
2031on a board and how board positions are scored. Humans
2032represent their strategies as a response to these constraints,
2033such that if the game changes, they can begin to adjust their
2034strategies accordingly.
2035In sum, Go presents compelling challenges for AI beyond
2036matching world-class human performance, in trying to
2037match human levels of understanding and generalization,
2038based on the same kinds and amounts of data, explicit
2039instructions, and opportunities for social learning afforded
2040to people. In learning to play Go as quickly and as ﬂexibly
2041as they do, people are drawing on most of the cognitive
2042ingredients this article has laid out. They are learning-tolearn with compositional knowledge. They are using their
2043core intuitive psychology and aspects of their intuitive
2044physics (spatial and object representations). And like
2045AlphaGo, they are also integrating model-free pattern recognition with model-based search. We believe that Go AI
2046systems could be built to do all of these things, potentially
2047better capturing how humans learn and understand the
2048game. We believe it would be richly rewarding for AI and
2049cognitive science to pursue this challenge together and
2050that such systems could be a compelling testbed for the
2051principles this article suggests, as well as building on all of
2052the progress to date that AlphaGo represents.
20536.2. Future applications to practical AI problems
2054
2055In this article, we suggested some ingredients for building
2056computational models with more human-like learning and
2057thought. These principles were explained in the context
2058of the Characters and Frostbite Challenges, with special
2059emphasis on reducing the amount of training data required
2060and facilitating transfer to novel yet related tasks. We also
2061see ways these ingredients can spur progress on core AI
2062problems with practical applications. Here we offer some
2063speculative thoughts on these applications.
20641. Scene understanding. Deep learning is moving
2065beyond object recognition and toward scene understanding, as evidenced by a ﬂurry of recent work focused on generating natural language captions for images (Karpathy &
2066Fei-Fei 2017; Vinyals et al. 2014; Xu et al. 2015). Yet
2067current algorithms are still better at recognizing objects
2068than understanding scenes, often getting the key objects
2069right but their causal relationships wrong (Fig. 6). We see
2070compositionality, causality, intuitive physics, and intuitive
2071psychology as playing an increasingly important role in
2072reaching true scene understanding. For example, picture
2073a cluttered garage workshop with screw drivers and
2074hammers hanging from the wall, wood pieces and tools
2075stacked precariously on a work desk, and shelving and
2076boxes framing the scene. For an autonomous agent to
2077effectively navigate and perform tasks in this environment,
2078the agent would need intuitive physics to properly reason
2079about stability and support. A holistic model of the scene
2080would require the composition of individual object
2081models, glued together by relations. Finally, causality
2082helps infuse the recognition of existing tools or the learning
2083of new ones with an understanding of their use, helping to
2084connect different object models in the proper way (e.g.,
2085
2086hammering a nail into a wall, or using a saw horse to
2087support a beam being cut by a saw). If the scene includes
2088people acting or interacting, it will be nearly impossible
2089to understand their actions without thinking about their
2090thoughts and especially their goals and intentions toward
2091the other objects and agents they believe are present.
20922. Autonomous agents and intelligent devices. Robots
2093and personal assistants such as cell phones cannot be pretrained on all possible concepts they may encounter. Like
2094a child learning the meaning of new words, an intelligent
2095and adaptive system should be able to learn new concepts
2096from a small number of examples, as they are encountered
2097naturally in the environment. Common concept types
2098include new spoken words (names like “Ban Ki-Moon”
2099and “Koﬁ Annan”), new gestures (a secret handshake and
2100a “ﬁst bump”), and new activities, and a human-like
2101system would be able to learn both to recognize and to
2102produce new instances from a small number of examples.
2103As with handwritten characters, a system may be able to
2104quickly learn new concepts by constructing them from
2105pre-existing primitive actions, informed by knowledge of
2106the underlying causal process and learning-to-learn.
21073. Autonomous driving. Perfect autonomous driving
2108requires intuitive psychology. Beyond detecting and avoiding pedestrians, autonomous cars could more accurately
2109predict pedestrian behavior by inferring mental states,
2110including their beliefs (e.g., Do they think it is safe to
2111cross the street? Are they paying attention?) and desires
2112(e.g., Where do they want to go? Do they want to cross?
2113Are they retrieving a ball lost in the street?). Similarly,
2114other drivers on the road have similarly complex mental
2115states underlying their behavior (e.g., Does he or she
2116want to change lanes? Pass another car? Is he or she swerving to avoid a hidden hazard? Is he or she distracted?). This
2117type of psychological reasoning, along with other types of
2118model-based causal and physical reasoning, are likely to be
2119especially valuable in challenging and novel driving circumstances for which there are few relevant training data (e.g.,
2120navigating unusual construction zones, natural disasters).
21214. Creative design. Creativity is often thought to be a
2122pinnacle of human intelligence. Chefs design new dishes,
2123musicians write new songs, architects design new buildings,
2124and entrepreneurs start new businesses. Although we are
2125still far from developing AI systems that can tackle these
2126types of tasks, we see compositionality and causality as
2127central to this goal. Many commonplace acts of creativity
2128are combinatorial, meaning they are unexpected combinations of familiar concepts or ideas (Boden 1998; Ward
21291994). As illustrated in Figure 1-iv, novel vehicles can be
2130created as a combination of parts from existing vehicles,
2131and similarly, novel characters can be constructed from
2132the parts of stylistically similar characters, or familiar characters can be re-conceptualized in novel styles (Rehling
21332001). In each case, the free combination of parts is not
2134enough on its own: Although compositionality and learning-to-learn can provide the parts for new ideas, causality
2135provides the glue that gives them coherence and purpose.
2136
21376.3. Toward more human-like learning and thinking
2138machines
2139
2140Since the birth of AI in the 1950s, people have wanted to
2141build machines that learn and think like people. We hope
2142
2143researchers in AI, machine learning, and cognitive science
2144will accept our challenge problems as a testbed for progress. Rather than just building systems that recognize
2145handwritten characters and play Frostbite or Go as the
2146end result of an asymptotic process, we suggest that deep
2147learning and other computational paradigms should aim
2148to tackle these tasks using as few training data as people
2149need, and also to evaluate models on a range of humanlike generalizations beyond the one task on which the
2150model was trained. We hope that the ingredients outlined
2151in this article will prove useful for working toward this
2152goal: seeing objects and agents rather than features, building causal models and not just recognizing patterns, recombining representations without needing to retrain, and
2153learning-to-learn rather than starting from scratch.