|chapter 4||Recognizing Objects|
Recognition: Some Early Considerations You’re obviously able to recognize a huge number of different patterns-different objects (cats, cups, coats), various actions (crawling, climbing, clapping), and different sorts of situations (crises, comedies). You can also recognize many variations of each of these things. You recognize cats standing up and cats sitting down, cats running and cats asleep. And the same is true for recognition of most other patterns in your recognition repertoire.
You also recognize objects even when the available information is incomplete. For example, you can still recognize a cat if only its head and one paw are visible behind a tree. You recognize a chair though the person blocks much of the chair from view recognize tens of thousands of words, and you can even when someone is sitting on it, even All of this is true for print as well. You can recognize them whether the words are printed in large type or small, italics or straight letters, UPPER CASE or lower. You can even recognize handwritten words, for which the variation from one to the next is huge. These variations in the “stimulus input” provide our first indication that object recognition complexity. Another indication comes from the fact that your recognition of various or otherwise, is influenced by the context in which you encounter those objects. involves some objects, print Consider Figure 4.3. The middle character is the same in both words, but the character looks more like an H in the word on the left and more like an A in the word on the right. With this, you easily read the word on the left as “THE” and not “TAE” and the word on the right as “CAT” and not “CHT.” Of course, object recognition is powerfully influenced by the stimulus itself-that is, by the features that are in view. Processes directly shaped by the stimulus are sometimes called “data driven” but are more commonly said to involve bottom-up processing. The effect of context, however, reminds us that recognition is also influenced by one’s knowledge and expectations. As a result, your reading of Figure 4.3 is guided by your knowledge that “THE” and “CAT” are common words but that “TAE” and “CHT” are not. This sort of influence-relying on your knowledge-is sometimes called “concept- driven,” and processes shaped by knowledge are said to involve top-down processing.
What mechanism underlies both the top- down and bottom -up influences? In the next section, we’ll consider a classic proposal for what the mechanism might be. We’ll then build on this base as we discuss more recent elaborations of this proposal. The Importance of Features Common sense suggests that many objects can be recognized by virtue of their parts. You recognize an elephant because you see the trunk, the thick legs, the large body. You know a lollipop is a lollipop because you see the circle shape on top of the straight stick. But how do you recognize the parts themselves? How, for example, do you recognize the trunk on the elephant or the circle in the lollipop? The answer may be simple: Perhaps you recognize the parts by looking at their parts-such as the arcs that make up the circle in the lollipop, or the (roughly) parallel lines that identify the elephant’s trunk.
To put this more generally, recognition might begin with the identification of visual features in the input pattern- the vertical lines, curves, diagonals, and so on. With these features appropriately catalogued, you can start assembling the larger units. If you detect a horizontal together with a vertical, you know you’re looking at a right angle; if you’ve detected four right angles, you know you’re looking at a square.
This broad proposal lines up well with the neuroscience evidence we discussed in Chapter 3 There, we saw that specialized cells in the visual system do seem to act as feature detectors, firing (producing an action potential) whenever the relevant input (i.e. the appropriate feature) is in view. Also, we’ve already noted that people can recognize many variations on the objects they encounter- cats in different positions, A’s in different fonts or different handwritings. An emphasis on features, though, might help with this point. The various A’s, for example, differ from one another in overall shape, but they do have certain things in common: two inwardly sloping lines and a horizontal crossbar. Focusing on features, therefore, might allow us to concentrate on elements shared by the various A’s and so might allow us to recognize A’s despite their apparent diversity. The importance of features is also evident in data from visual search tasks -tasks in which are asked to examine a display and to judge whether a particular target is present in the or not. This search is remarkably efficient when someone is searching for a target defined by participants display a simple feature-for example, finding a vertical segment in a field of horizontals or a green shape a field of red shapes. But people in are generally slower in searching for a target defined as a combination of features (see Figure 4.4). This is just what we would expect if feature analysis is an early step in your analysis of the visual world-and separate from the step in which you combine the features you’ve detected. Further support for these claims comes from studies of brain damage. At the start of the chapter, we mentioned apperceptive agnosia-a disorder that involves an inability to assemble the various aspects of an input into an organized whole. A related disorder, integrative agnosia, derives from damage to the parietal lobe. Patients with this disorder appear relatively normal in tasks requiring them simply them to judge how the features are bound together to form complex objects. (See, for example, markedly impaired in tasks that require display, but they are to detect features in a Behrmann, Peterson, Moscovitch, & Suzuki, 2006; Humphreys & Riddoch, 2014; Robertson, Treisman, Friedman-Hill, & Grabowecky, 1997. For related results, in which transcranial magnetic stimulation was used to disrupt portions of the brain in healthy individuals, see Ashbridge, Walsh, & Cowey, 1997.) e. Demonstration 4.1: Features and Feature Combination The chapter discusses the importance of features in your recognition of objects, and this priority of features is, in fact, easy to demonstrate. In each of the squares below, find the target. In the square can you find… Most people find the search in Square A (finding the O) to be extremely easy; the search in Square B (finding the L) is harder. Why is this? In Square A, all you need to do is search for the feature “curve” (or “roundness”). That single feature is enough to identify the target, and searching for a single feature is fast and easy. In contrast, the target in Square B is not defined in terms of a single feature. The target (the L) and the distractor items (the T’s) have the same features (one horizontal and one vertical); the target is distinguished from the distractors only in how the features are assembled. Therefore, you can’t locate the target in Square B simply by hunting for a single feature; instead, you need to take an additional step: You need to think about how the features are put together, and that’s a slower, more effortful process.
The data pattern becomes even clearer in Squares C and D. In Square C, the target (again, defined by a single feature) seems to “pop out” at you, and your search in C is just as fast as it was in A. In other words, when hunting for a single feature, you can hunt through 11 items (Square C) as quickly as you can hunt through 4 items (Square A).
In Square D, however, the target doesn’t “pop out.” Here, you’re searching for a feature combination, so you need to examine the forms one by one. As a result, Square D (with 11 shapes to examine) takes more time than Square B (with just 4).
Demonstration adapted from Thornton, T., & Gilden, D. (2007). Parallel and serial processes in visual search. Psychological Review, 114, 71-103. Word Recognition Several lines of evidence, therefore, indicate that object recognition does begin with the detection of simple features. Then, once this detection has occurred, separate mechanisms are needed to put the features together, assembling them into complete objects. But how does this assembly proceed, so that we end up seeing not just the features but whole words-or Chihuahuas, or fire hydrants? In tackling this question, it will be helpful to fill in some more facts that we can then use as a guide to our theory building. Factors Influencing Recognition In many studies, participants have been shown stimuli for just a brief duration-perhaps 20 or 30 ms (milliseconds). Older research did this by means of a tachistoscope, a device designed to present stimuli for precisely controlled amounts of time. More modern research uses computers, but the brief displays are still called “tachistoscopic presentations.”
Each stimulus is followed by a post-stimulus mask-often, a random pattern of lines and curves, or a random jumble of letters such as “XJDKEL” The mask interrupts any continued processing that participants might try to do for the stimulus just presented. In this way, researchers can be certain that a stimulus presented for (say) 20 ms is visible for exactly 20 ms and no longer. Can people recognize these briefly visible stimuli? The answer including how familiar a stimulus is. If the stimulus is a word, we can measure familiarity by depends on many factors, counting how often that word appears in print, and these counts are an excellent predictor of tachistoscopic recognition. In one early experiment, Jacoby and Dallas (1981) showed participants words that were either very frequent (appearing infrequent (occurring only 1 to 5 times per million words of print). Participants viewed these words for 35 ms, followed by a mask. Under these circumstances, they recognized almost twice as many of at least 50 times in every million printed words) or the frequent words (see Figure 4.5A). Another factor influencing recognition is recency of view. If participants view a word and then, a little later, view it again, they will recognize the word more readily the second time around. The first exposure primes the participant for the second exposure; more specifically, this is a case of repetition priming.
As an example, participants in one study read a list of words aloud. The participants were then shown a series of words in a tachistoscope. Some of these words were from the earlier list and so had been primed; others were unprimed. For words that were high in frequency, 68 % of the unprimed words were recognized, compared to 84 % of the primed words. For words low in frequency, 37% of the unprimed words were recognized, compared to 73% of the primed words (see Figure 4.5B; Jacoby & Dallas, 1981). The Word-Superiority Effect Figure 4.3 suggests that the recognition of a letter depends on its context-and so an ambiguous letter is read as an A in one setting but an H in another setting. But context also has another effect: Even when a letter is properly printed and quite unambiguous, it’s easier to recognize if it appears within a word than if it appears in isolation. This result might seem paradoxical, because here we have a setting in which it seems easier to do “more work” rather than “less”-and so you’re more accurate in recognizing all the letters that make up a word (maybe a total of five or six letters) than you are in recognizing just one letter on its own. Paradoxical or not, this pattern is easy to demonstrate, and the advantage for perceiving letters-in- context is called the word-superiority effect (WSE).
The WSE is demonstrated with a “two-alternative, forced-choice” procedure. For example, in some trials we might present a single letter-let’s say K-followed by a post-stimulus mask, and follow that with a question: “Which of these was in the display: an E or a K?” In other trials, we might present a word-let’s say “DARK”-followed by a mask, followed by a question: “Which of these was in the display: an E or a K?”
Note that participants have a 50-50 chance of guessing correctly in either of these situations, and so any contribution from guessing is the same for the letters as it is for the words. Also, for the word stimulus, both of the letters we’ve asked about are plausible endings for the stimulus; either ending would create a common word (“DARE” or “DARK”). Therefore, participants who saw only part of the display (perhaps “DAR”) couldn’t use their knowledge of the language to figure out the display’s final letter. In order to choose between E and K, therefore, participants really need to have seen the relevant letter- and that is exactly what we want.
In this procedure, accuracy rates are reliably higher in the word condition. Apparently, recognizing an entire word is easier than recognizing isolated letters (see Figure 4.6; Johnston & McClelland, 1973; Reicher, 1969; Rumelhart & Siple, 1974; Wheeler, 1970). Degree of Well-Formedness As it turns out, though, the term “word-superiority effect” may be misleading, because we don’t need words to produce the pattern evident in Figure 4.6. We get a similar effect if we present participants with letter strings like “FIKE” or “LAFE” These letter strings are not English words and they’re not familiar, but they look like English strings and (related) are easy to pronounce. And, crucially, strings like these produce a context effect, with the result that letters in these contexts are easier to identify than letters alone.
This effect occurs, though, only if the context is of the right sort. There’s no context benefit if we present a string like “HZYE” or “SBNE” An E presented within these strings will not show the word- superiority effect-that is, it won’t be recognized more readily than an E presented in isolation.
A parallel set of findings emerge if, instead of asking participants to detect specific letters, ask them to report all of what they have seen. A letter string like “HZYE” is extremely hard to recognize if presented briefly. With a stimulus like this and, say, a 30-ms exposure, participants may report that they only saw a flash and no letters at all; at best, they may report a letter or two. But with the same 30-ms exposure, participants will generally recognize (and be able to report) strings like “FIKE” or “LAFE” although they do even better if the stimuli presented are actual, familiar words.
How should we think about these findings? One approach emphasizes the statistically defined regularities in English spelling. Specifically, we can work through a dictionary, counting how often (for example) the letter combination “FI” occurs, or the combination “LA,” or “HZ” We can do the same for three-letter sequences (“FIK,” “LAF” and so on). These counts will give us a tally that reveals which letter combinations are more probable in English spelling and which are less probable. We can then use this tally to evaluate new strings-asking, for any string, whether its letter sequences are high-probability ones (occurring often) or low-probability (occurring rarely). These statistical measures allow us to evaluate how “well formed” a letter string is-that is, how well the letter sequence conforms to the usual spelling patterns of English-and well-formedness is a good predictor of word recognition: The more recognize that string, and also the greater the context benefit the string will produce. This well- documented pattern has been known for more than a century (see, e.g., Cattell, 1885) and has been English-like the string is, the easier it will be to replicated in many studies (Gibson, Bishop, Schiff, & Smith, 1964; Miller, Bruner, & Postman, 1954). Making Errors Let’s recap some important points. First, it seems that a letter will be easier to recognize if it appears in a well-formed sequence, but not if it appears in a random sequence. Second, well-formed strings perceive than ill-formed strings; this advantage remains even if the well- are, overall, easier to formed strings are made-up ones that you’ve “COTER”). All using your knowledge of spelling patterns when you look at, and recognize, the words you encounter-and so you have an easier time with letter strings never seen before (strings like “HAKE” or of these facts suggest that you somehow are that conform to these patterns, compared to strings that do not. The influence of spelling patterns is also evident in the mistakes you make. With brief exposures, word recognition is good but not perfect, and the errors that occur are systematic: There’s a strong tendency to misread less-common letter sequences as if they were more-common patterns. So, for example, “TPUM” is likely to be misread as “TRUM” or even “DRUM.” But the reverse errors are rare: “DRUM” is unlikely to be misread as “TRUM” or “TPUM” These errors can sometimes be quite large-so that someone shown “TPUM” might instead perceive “TRUMPET” But, large or small, the errors show the pattern described: Misspelled words partial words, or nonwords are read in a way that brings them into line with normal spelling. In effect, people perceive the input as being recognition seems to be guided by (or, in this case, misguided by) some knowledge of spelling more regular than it actually is. Once again, therefore, our patterns. e. Demonstration 4.2: The Broad Influence of the Rules of Spelling As the chapter describes, many lines of evidence suggest that we are sensitive to the rules of English spelling: We recognize letter strings more easily if the strings conform to these rules. Likewise strings that obey the spelling rules produce a context effect, but other letter strings do not. In addition, our recognition errors seem to be guided by these rules.
Here’s another way to demonstrate how you’re influenced by the rules of spelling. First, get a pen and a blank piece of paper ready. Next, start tapping your foot at a speed of roughly one tap per second. Then, with your foot still tapping, read the following list at a speed of taps (roughly 2 seconds) per word. When you’re done, close the list, grab the pen, and write down as many of the words as you can remember. HIDE Trap Crown Salt Raise Bore Clean Kite Twist Cord Blend How well did you do? Next, start your foot tapping again-with a speed of two taps per letter string-and do the same task with the following list. HIDE Kero Eevts Wtse Iymts Ookc Rgzea Nmdi Rlmao Kesi Ugdra How well did you do? It’s virtually certain that you did less well with the second list, but why? Is it just because you were familiar with the letter strings in the first list (all common words) but totally unfamiliar with the strings in the second list? To test this hypothesis, try the task again with the following list-again, at two taps per letter string. HIDE Fird Teash Buls Blost Vose Shune Tupe Stend Hune Glone How well did you do when the strings are totally unfamiliar but conform to the rules of spelling? Feature Nets and Word Recognition What lies behind this broad pattern of evidence? What are the processes inside of us that lead to the findings we’ve described? Psychologists’ understanding of these points grows out of a theory published many years ago (Selfridge, 1959). Let’s start with that theory, and then use it as our base as we look at more modern work. (For a glimpse of some of the modern research, including work that links theorizing to neuroscience, see Carreiras, Armstrong, Perea, & Frost, 2014.) The Design of a Feature Net
Imagine that we want to view. How might design a system that will recognize the word “CLOCK” whenever it is in our “CLOCK” detector work? One option is to “wire” this detector to a C-detector an L-detector, an O-detector, and so on. Then, whenever these letter detectors are activated, this would activate the word detector. But what activates the letter detectors? Maybe the L-detector is “wired” to a horizontal-line detector and also a vertical-line detector, as shown in Figure 4.7. When these feature detectors are activated, this activates the letter detector. The idea is that there could be a network of detectors, organized in layers. The “bottom” layer is concerned with features, and that is why networks of this sort are often called feature nets. As we move “upward” in the network, each subsequent layer is concerned with larger-scale objects; using the term we introduced earlier, the flow of information would be bottom-up-from the lower levels toward the upper levels.
But what does it mean to “activate” a detector? At any point in time, each detector in the network has a particular activation level, which reflects the status of the detector at that moment-roughly how energized the detector is. When a detector receives some input, its activation level increases. A strong input will increase the activation level by a lot, and so will a series of weaker inputs. In either case, the activation level will eventually reach the detector’s response threshold, and at that point the detector will fire-that is, send its signal to the other detectors to which it is connected.
These points parallel our description of neurons in Chapter 2, and that’s no accident. If the feature net is to be a serious candidate for how humans recognize patterns, then it has to use the same sorts of building blocks that the brain does. However, let’s be careful not to overstate this point: No one is suggesting that detectors are neurons or even large groups of neurons. Instead, detectors probably involve complex assemblies of neural tissue. Nonetheless, it’s plainly attractive that the hypothesized detectors in the feature net function in a way that’s biologically sensible. Within the net, some detectors will be easier to activate than others-that is, some will require a strong input to make them fire, while others will fire even with a weak input. This difference is created in part by how activated each detector is to begin with. If the detector is moderately activated at the start, then only a little input is needed to raise the activation level to threshold, and so it will be easy to make this detector fire. If a detector is not at all activated at the start, then a strong input is needed to bring the detector to threshold, and so it will be more difficult to make this detector fire.
What determines a detector’s starting activation level? As one factor, detectors that have fired recently will have a higher activation level (think of it as a “warm-up” effect). In addition, detectors that have fired frequently in the past will have a higher activation level (think of it as an “exercise” effect). Overall, then, activation level is dependent on principles of recency and frequency.
We now can put these mechanisms to work. Why are frequent words in the language easier to recognize than rare words? Frequent words, by definition, appear often in the things you read. Therefore, the detectors needed for recognizing these words have been frequently used, so they have relatively high levels of activation. Thus, even a weak signal (e.g., a brief or dim presentation of the word) will bring these detectors to their response threshold and will be enough to make them fire. As a result, the word will be recognized even with a degraded input.
Repetition priming is explained in similar terms. Presenting a word once will cause the relevant detectors to fire. Once they’ve fired, activation levels will be temporarily lifted (because of recency of use). Therefore, only a weak signal will be needed to make the detectors fire again. As a result, the word will be more easily recognized the second time around. The Feature Net and Well-Formedness The net we’ve described so far cannot, however, explain all of the data. Consider the effects of well- formedness-for instance, the fact that people are able to read letter strings like “PIRT” or “HICE” even when those strings are presented very briefly (or dimly or in low contrast), but not strings like explain this finding? One option is to add another layer to the net, a layer filled with detectors for letter conmbinations. Thus, in Figure 4.8, we’ve added a layer of bigram-detectors of letter pairs. These detectors, like all the rest, will be triggered by lower-level detectors and send their output to higher-level detectors. And just like any other detector, each bigram detector will start out with a certain activation level, influenced by the frequency with which the detector has fired in the past and by the recency with which it has fired. This turns out to be all the theory we need. You have never seen the sequence “HICE” before, but you have seen the letter pair HI (in “HIT,” “HIGH,” or “HILL”) and the pair CE (“FACE,” “MICE” “JUICE”). The detectors for these letter pairs, therefore, have high activation levels at the start, so they don’t need much additional input to reach their threshold. As a result, these detectors will fire with only weak input. That will make the corresponding letter combinations easy to recognize, facilitating the recognition of strings like “HICE” None of this is true for “IJPV” or “RSFK” Because none of these letter combinations are familiar, these strings will receive no benefits from priming. As a result, a strong input will be needed to bring the relevant detectors to threshold, and so these strings will be recognized only with difficulty. (For more on bigram detectors and how they work see Grainger, Rey, & Dufau, 2008; Grainger & Whitney, 2004; Whitney, 2001. For some complications, see Rayner & Pollatsek, 2011.) Recovery from Confusion Imagine that we present the word “CORN” for just 20 ms. In this setting, the visual system has only a limited opportunity to analyze the input, so it’s possible that you’ll miss some of the input’s features. For example, let’s imagine that the second letter in this word-the O-is hard to see, so that only the bottom curve is detected. This partial information invites confusion. If all you know is “the second letter had a bottom maybe an S. Figure 4.9 perhaps it was a U, or a Q, or curve,” then perhaps this letter was an O, or shows how this would play out in terms of the network. We’ve already said that you detected the bottom curve, and that means the “bottom-curve detector” is activated. This detector, in turn provides input to the O-detector and also to the detectors for U, Q, and S, and so activation in this feature detector causes activation in all of these letter detectors. Of course, each of these letter detectors is wired so that it can also receive input from other feature detectors. (And so usually the O-detector also gets input from detectors for left curves, right curves, and top curves.) We’ve already said, though, that with this brief input these other features weren’t detected this time around. As a result, the O-detector will only be weakly activated (because it’s not getting its usual full input), and the same is true for the detectors for U. Q, and S.
In this situation, therefore, the network has partial information at the feature level (because only one of the O’s features was detected), and this leads to confusion at the letter level: Too many letter detectors are firing (because the now-activated bottom-curve detector is wired to all of them). And, roughly speaking, all of these letter detectors are firing in a fashion that signals uncertainty, because they’re each receiving input from only one of their usual feature detectors.
The confusion continues in the information sent upward from the letter level to the bigram level. The detector for the CO bigram will receive a strong signal from the C-detector (because the C was clearly visible) but only a weak signal from the O-detector (because the O wasn’t clearly visible). The CU-detector will get roughly the same input-a strong signal from the C-detector and a weak signal from the U-detector. Likewise for the CQ- and CS-detectors. In other words, we can imagine that the signal being sent from the letter detectors is “maybe CO or maybe CU or maybe CQ or maybe CS.” The confusion is, however, sorted out at the bigram level. All four bigram detectors in this situation are receiving the same input-a strong signal from one of their letters and a weak signal from the other. But the four detectors don’t all respond in the same way. The CO-detector is well primed (because this is a frequent pattern), so the activation it’s receiving will probably be enough to fire this (primed) detector. The CU-detector is less primed (because this is a less frequent pattern); the CQ- and CS-detectors, if they even exist, are not primed at all. The input to these latter detectors is therefore unlikely to activate them-because, again, they’re less well primed and so won’t respond to this weak input.
What will be the result of all this? The network was “under-stimulated” at the feature level (with only a subset of the input’s features detected) and therefore confused at the letter level (with too many detectors firing). But then, at the bigram level, it’s only the CO-detector that fires, because at this level it is the detector (because of priming) most likely to respond to the weak input. Thus, in a totally automatic fashion, the network recovers from its own confusion and, in this case, avoids an error. Ambiguous Inputs Look again at Figure 4.3. The second character is exactly the same as the fifth, but the left-hand string is perceived as “THE” (and the character is identified as an H) and the right-hand string is perceived as “CAT” (and the character as an A).
What’s going on here? In the string on the left, the initial T is clearly in view, and so presumably the T-detector will fire strongly in response. The next character in the display will probably trigger some of the features normally associated with an A and some normally associated with an H. This will cause the A-detector to fire, but only weakly (because only some of the A’s features are present), and likewise for the H-detector. At the letter level, then, there will be uncertainty about what this character is.
What happens next, though, follows a by-now familiar logic: With only weak activation of the A- and H-detectors, only a moderate signal will be sent upward to the TH- and TA-detectors. Likewise, it seems plausible that only a moderate signal will be sent to the THE- and TAE-detectors at the word level. But, of course, the THE-detector is enormously well primed; if there is a TAE-detector. it would be barely primed, since this is a string that’s rarely encountered. Thus, the THE- and TAE- detectors might be receiving similar input, but this input is sufficient only for the (well-primed) THE-detector, so only it will respond. In this way, the net will recognize the ambiguous pattern as “THE” not “TAE” (The same logic applies, of course, to the ambiguous pattern on the right, perceived as “CAT” not “CHT). A similar explanation will handle the word-superiority effect (see, e.g., Rumelhart & Siple, 1974). To take a simple case, imagine that present the letter A in the context “AT” If the presentation brief enough, participants may see very little of the A, perhaps just the horizontal crossbar. This wouldn’t be enough to distinguish among A, F, or H, and so all these letter detectors would fire weakly. If this were all the information the participants had, they’d be stuck. But let’s imagine that the participants did perceive the second letter the display, the T. It seems likely that the AT bigram is much better primed than the FT or HT bigrams. (That’s because you often encounter words like “CAT” or “BOAT”; words like “SOFT” or “HEFT” are used less frequently.) Therefore, the weak firing of the A-detector would be enough to fire the AT bigram detector, while the weak firing for the F and H might not trigger their bigram detectors. In this way, a “choice” would be made at the bigram level that the input was “AT and not something else. Once this bigram has been detected, answering the question “Was there an A or an F in the display?” is easy. In this way, the letter will be better detected in context than in isolation. This isn’t because context enables you to see more: instead, context allows you to make better use of what you see. Recognition Errors There is, however, a downside to all this. Imagine that we present the string “CQRN” to participants. If the presentation is brief enough, the participants will register only a subset of the string’s features. Let’s imagine that they register only the bottom bit of the string’s second letter. This detection of the bottom curve will weakly activate the Q-detector and also the U-detector and the O-detector. The resulting pattern of network activation is shown in Figure 4.10. Of course, the pattern of activation here is exactly the same as it was in Figure 4.9. In both cases, perceivers have seen the features for the C, R, and N and have only seen the second letter’s bottom curve. And we’ve already walked through the network’s response to this feature pattern: This configuration will lead to confusion at the letter level, but the confusion will get sorted out at the bigram level, with the (primed) CO-detector responding to this input and other (less well primed) detectors not responding. As a result, the stimulus will be (mis)identified as “CORN.” In the situation described in Figure 4.9, the stimulus actually was “CORN,” and so the dynamic built into the net aids performance, allowing the network to recover from its initial confusion. In the case we’re considering now (with “CQRN” as the stimulus), the exact same dynamic causes the network to misread the stimulus.
This example helps us understand how recognition errors occur and why those errors tend to make the input look more regular than it really is. The basic idea is that the network is biased favoring frequent letter combinations over infrequent ones. In effect, the network operates on the basis of “when in doubt, assume that the input falls into the frequent pattern.” The reason, of course is simply that the detectors for the frequent pattern are well primed-and therefore easier to trigger.
Let’s emphasize, though, that the bias built into the network facilitates perception if the input is, in fact, a frequent word, and these (by definition) are the words you encounter most of the time. The bias will pull the network toward errors if the input happens to have an unusual spelling pattern, but (by definition) these inputs are less common in your experience. Hence, the network’s bias helps perception more often than it hurts. Distributed Knowledge We’ve now seen many indications that the network’s functioning is guided by knowledge of spelling patterns. This is evident in the fact that letter strings are easier to recognize if they conform to normal spelling. The same point is evident in the fact that letter strings provide a context benefit (the WSE) only if they conform to normal spelling. Even more evidence comes from the fact that when errors occur, they “shift” the perception toward patterns of normal spelling.
To explain these results, we’ve suggested that the network “knows” (for example) that CO is a common bigram in English, while CF is not, and also “knows” that THE is a common sequence but TAE is not. The network seems to rely on this “knowledge” in “choosing its “interpretation” of unclear or ambiguous inputs. Similarly, the network seems to “expect” certain patterns and not others, and is more efficient when the input lines up with those “expectations.” Obviously, we’ve wrapped quotations around several of these words to emphasize that the sense in which the net “knows” facts about spelling, or the sense in which it “expects” things or makes “interpretations,” is a little peculiar. In reality, knowledge about spelling patterns isn’t explicitly stored anywhere in the network. Nowhere within the net is there a sentence like “CO is a common bigram in English; CF is not.” Instead, this memory (if we even want to call it that) is manifest only in the fact that the CO-detector happens to be more primed than the CF-detector. The CO-detector doesn’t “know” anything about this advantage, nor does the CF-detector know anything about its disadvantage. Each one simply does its job, and in the course of doing their jobs, sometimes a “competition” will take place between these detectors. (This sort of competition was illustrated in Figures 4.9 and 4.10.) When these competitions occur, theyll be “decided” by activation levels: The better-primed detector will be more likely to respond and therefore will be more likely to influence subsequent events. That’s the entire mechanism through which these “knowledge effects” arise. That’s how “expectations” or “inferences” emerge-as a direct consequence of the activation levels.
To put this into technical terms, the network’s “knowledge” is not locally represented anywhere; it isn’t stored ina particular location or built into a specific process. As a result, we cannot look just at the level of priming in the CO-detector and conclude that this detector represents a frequently seen bigram. Nor can we look at the CF-detector and conclude that it represents a rarely seen bigram. Instead, we need to look at the relationship between these priming levels, and we also need to look at how this relationship will lead to one detector being more influential than the other. In this way, the knowledge about bigram frequencies is contained within the network via a distributed representation: it’s knowledge, in other words, that’s represented by a pattern of activations that’s distributed across the network and detectable only if we consider how the entire network functions. What may be most remarkable about the feature net, then, lies in how much can be accomplished with a distributed representation, and thus with simple, mechanical elements correctly connected to one another. The net appears to make inferences and to know the rules of English spelling. But the actual mechanics of the net involve neither inferences nor knowledge (at least, not in any conventional sense). You and I can see how the inferences unfold by taking a bird’s-eye view and considering how all the detectors work together as a system. But nothing in the net’s functioning depends on the bird’s-eye view. Instead, the activity of each detector is locally determined- influenced by just those detectors feeding into it. When all these detectors work together, though, no role in the result is a process that acts as if it knows the rules. But the rules themselves play guiding the network’s moment-by-moment activities.
Efficiency versus Accuracy
One other point about the network needs emphasis: The network does make mistakes, misreading some inputs and misinterpreting some patterns. As we’ve seen, though, these errors are produced by exactly the same mechanisms that are responsible for the network’s main advantages-its ability to deal with ambiguous inputs, for example, or to recover from confusion. Perhaps, therefore, we should view the errors as the price you pay in order to gain the benefits associated with the net: If you want a mechanism that’s able to deal with unclear or partial inputs, you have to live with the fact that sometimes the mechanism will make errors.
But do you really need to pay this price? After all, outside of the lab you’re unlikely to encounter fast-paced tachistoscopic inputs. Instead, you see stimuli that are out in view for long periods of time, stimuli that you can inspect at your leisure. Why, therefore, don’t you take the moment to scrutinize these inputs so that you can rely on fewer inferences and assumptions, and in that way gain a higher level of accuracy in recognizing the objects you encounter? The answer is straightforward. To maximize accuracy, you could, in principle, scrutinize every misprinted, you would be sure to character on the page. That way, if a character were missing or detect it. But the cost associated with this strategy would be intolerable. Reading would be unspeakably slow (partly because the speed with which you move your eyes is relatively slow-no more than four or five eye movements per second). In contrast, it’s possible to make inferences about a page with remarkable speed, and this leads readers to adopt the obvious strategy: They read some of the letters and make inferences about the rest. And for the most part, those inferences are safe-thanks to the simple fact that our language (like most aspects of our world) contains some redundncies, so that one doesn’t need every lettr to identify what a wrd is; oftn the missng letter perfctly predctable from the contxt, virtually guaranteeing that inferences will be correct. e. Demonstration 4.3: Inferences in Reading The role of inferences within the normal process of reading is easy to demonstrate. In fact, you make these inferences even when you don’t want to-that is, even when you’re trying to read carefully. To see this, count how many times the letter F appears in the following passage. FINISHED FILES ARE THE RESULT OF YEARS OF SCIENTIFIC STUDYCOMBINED WITH THE EXPERIENCE OF YEARS. The correct answer is six; did you find them all? Many people miss one, two, or even three of the Fs. Why is this? In normal reading, you don’t look at every letter, identifying it before you move on to the next. If you did, reading would be impossibly slow. (We can only move our eyes four or five times each second. If we looked at letter, we’d only be able to read about 5 characters per second, or 300 characters per minute. For most material, we read at a rate roughly 500% faster than this.)
How, then, do you read? You actually skip many of the characters on the page, letting your eyes hop along each line of print, relying inference to fill in the information that your eyes are skipping over. As the text chapter describes, this process is made possible by the “inferential” character of your recognition network, and it is enormously efficient. However, the process also risks error- because your inferences sometimes are wrong, and because you can sometimes miss something (a specific word or letter) that you’re hunting for-as in this demonstration. The process of skipping and making inferences is especially likely to occur when the words in the text are predictable. If (for example) a sentence uses the phrase “Birds of a feather flock together,” you surely know what the last word in this sequence will be without looking at it carefully. As a result, this is a word you’re likely to skip over as you read along. By the same logic, the word “of is often quite predictable in many sentences, and so you probably missed one of the three Fs appearing in the word “of.” (The f in “of” is also hard to spot for another reason: Many people search for the f by sounding out the sentence and listening for the [f] sound. Of course, “of is pronounced as though it ended with a v, not an f, and so it doesn’t contain the sound people are hunting for.)
Note in addition that this process of skipping and inferring is so well practiced and so routine that you cannot “turn off” the process, even when you want to. This is why proofreading is usually difficult, as you skip by (and therefore overlook) your own errors. It’s also why this demonstration works-because you have a hard time forcing yourself into a letter-by-letter examination of the text even when you want to. Descendants of the Feature Net We mentioned early on that we were discussing the “classic” version of the feature net. This bring number of themes into view-including the trade-off between efficiency and accuracy and the idea of distributed knowledge built into a network’s functioning.
Over the years, though, researchers have offered improvements and in the next sections we’ll consider three of their proposals. All three preserve the idea of a network of interconnected detectors, but all three extend this idea in important ways. We’ll look first at a proposal that highlights the role of inhibitory connections among detectors. Then we’ll turn to a proposal that applies the network idea to the recognition of complex three-dimensional objects. Finally, we’ll consider a proposal that rests on the idea that your ability to recognize objects may depend on your viewing perspective when you encounter those objects.
The McClelland and Rumelhart Model In the network proposal we’ve considered so far, activation of one detector serves to activate other detectors. Other models involve a mechanism through which detectors can inhibit one another, so that the activation of one detector can decrease the activation in other detectors.
One highly influential model of this sort was proposed by McClelland and Rumelhart (1981); a portion of their model is illustrated in Figure 4.11. This network, like the one weve been discussing, is better able to identify well-formed strings than irregular strings; this net is also more efficient in identifying characters in context as opposed to characters in isolation. However, several attributes of this net make it possible to accomplish all this without bigram detectors.
In Figure 4.11, excitatory connections-connections that allow one detector to activate its neighbors-are shown as red arrows; for example, detection of a T serves to “excite” the “TRIP” detector. Other connections are inhibitory, and so (for example) detection of a G deactivates, or inhibits, the “TRIP” detector. These inhibitory connections are shown in the figure with dots. In addition, this model allows for more complicated signaling than we’ve used so far. In our discussion, we have assumed that lower-level detectors trigger upper-level detectors, but not the reverse. The flow of information, it seemed, was a one-way street. In the McClelland and Rumelhart model, though, higher-level detectors (word detectors) can influence lower-level detectors, and detectors at any level can also influence other detectors at the same level (e.g., letter detectors can inhibit other letter detectors; word detectors can inhibit other word detectors). To see how this would work, let’s say that the word “TRIP” is briefly shown, allowing a viewer to see enough features to identify only the R, I, and P. Detectors for these letters will therefore fire, in turn activating the detector for “TRIP” Activation of this word detector will inhibit the firing of other word detectors (e.g., detectors for “TRAP” and “TAKE”), so that these other words are less likely to arise as distractions or competitors with the target word.
At the same time, activation of the “TRIP” detector will also excite the detectors for its component letters-that is, detectors for T, R, I, and P. The R-, I-, and P-detectors, we’ve assumed were already firing, so this extra activation “from above” has little impact. But the T-detector wasn’t firing before. The relevant features were on the scene but in a degraded form (thanks to the brief presentation), and this weak input was insufficient to trigger an unprimed detector. But once the excitation from the “TRIP” detector primes the T-detector, it’s more likely to fire, even with a weak input.
In effect, then, activation of the word detector for “TRIP” implies that this is a context in which a T is quite likely. The network therefore responds to this suggestion by “preparing itself for a T. Once the network is suitably prepared (by the appropriate priming), detection of this letter is facilitated. In this way, the detection of a letter sequence (the word “TRIP”) makes the network more sensitive to elements that are likely to occur within that sequence. That is exactly what we need in order for the network to be responsive to the regularities of spelling patterns. Let’s also note that the two-way communication that’s in play here fits well with how the nervous system operates: Neurons in the eyeballs send activation to the brain but also receive activation from the brain; neurons in the lateral geniculate nucleus (LGN) send activation to the visual cortex but also receive activation from the cortex. Facts like these make it clear that visual processing is not a one-way process, with information flowing simply from the eyes toward the brain. Instead, signaling occurs in both an ascending (toward the brain) and a descending (away from the brain) direction, just as the McClelland and Rumelhart model claims. Recognition by Components The McClelland and Rumelhart model-like the feature net we started with-was designed initially as an account of how people recognize printed language. But, of course, we recognize many objects other than print, including the three-dimensional objects that fill our world-chairs and lamps and cars and trees. Can these objects also be recognized by a feature network? The answer turns out to be yes.
Consider a network theory known as the recognition by components (RBC) model (Hummel & Biederman, 1992; Hummel, 2013). This model includes several important innovations, one of which is the inclusion of an intermediate level of detectors, sensitive to geons (short for “geometric ions”). The idea is that geons might serve as the basic building blocks of all the objects we recognize-geons are, in essence, the alphabet from which all objects are constructed. Geons are simple shapes, such as cylinders, cones, and blocks (see Figure 4.12A), and according to Biederman (1987, 1990), we only need 30 or so different geons to describe every object in the world, just as 26 letters are all we need to spell all the words of English. These geons can be combined in various ways-in a top-of relation, or a side-connected relation, and so on-to create all the objects we perceive (see Figure 4.12B). The RBC model, like the other networks we’ve been discussing, uses a hierarchy of detectors. The lowest-level detectors are feature detectors, which respond to edges, curves, angles, and so on. These detectors in turn activate the geon detectors. Higher levels of detectors are then sensitive to combinations of geons. More precisely, geons are assembled into complex arrangements called “geon assemblies,” which explicitly represent the relations between geons (e.g., top-of or side- connected). These assemblies, finally, activate the object model, a representation of the complete, recognized object.
The presence of the geon and geon-assembly levels within this hierarchy offers several advantages. For one, geons can be identified from virtually any angle of view. As a result, recognition based on geons is viewpoint-independent. Thus, no matter what your position is relative to a cat, you’ll be able to identify its geons and identify the cat. Moreover, it seems that most objects can be recognized from just a few geons. As a consequence, geon-based models like RBC can recognize an object even if many of the object’s geons are hidden from view. Recognition via Multiple Views A number of researchers have offered a different approach to object recognition (Hayward & Williams, 2000; Tarr, 1995; Tarr & Bülthoff, 1998; Vuong & Tarr, 2004; Wallis & Bülthoff, 1999). They propose that people have stored in memory a number of different views of each object they recognize: can an image of what a cat looks like when viewed head-on, an image of what it looks like from the left, and so on. According to this perspective, you’ll recognize Felix as a cat only if you can match your current view of Felix with one of these remembered views. But the number of views in memory is limited-maybe a half dozen or so-and so, in many cases, your current view won’t line up with any of the available images. In that situation, you’ll need to “rotate” the current view to bring it into alignment with one of the views in memory, and this mental rotation will cause a slight delay in the recognition.
The key, then, is that recognition sometimes requires mental rotation, and as a result it will be slower from some viewpoints than from others. In other words, the speed of recognition will be view point-dependent, and a growing body of data confirms this claim. We’ve already noted that you can recognize objects from many different angles, and your recognition is generally fast. However, data angles than others, in a way that’s consistent with this indicate that recognition is faster from some multiple-views proposal. According to this perspective, how exactly does viewpoint-dependent recognition proceed? One proposal resembles the network models we’ve been discussing (Riesenhuber & Poggio, 1999, 2002; Tarr, 1999). In this proposal, there is a hierarchy of detectors, with each successive layer within the network concerned with more complex aspects of the whole. Thus, low-level detectors respond to lines at certain orientations; higher-level detectors respond to corners and notches. At the top of are detectors that respond to the sight of whole objects. It is important, though, that these detectors each represent what the object looks like from a particular vantage point, and so the hierarchy detectors fire when there is a match to one of these view-tuned representations.
These representations are probably supported by tissue in the inferotemporal cortex, near the terminus of the what pathway (see Figure 3.10). Recording from cells in this area has shown that many neurons here seem object-specific-that is, they fire preferentially when a certain type of object is on the scene. (For an example of just how specific these cells can be in their “preferred” target, see Figure 4.13.) Crucially, though, many of these neurons are view-tuned: They fire most strongly multiple-views proposal (Peissig & Tarr, 2007). However, there has been lively debate between advocates of the RBC approach (with its claim that recognition is largely viewpoint-independent) and the multiple-views approach (with its argument that recognition is viewpoint-dependent). And this may be a case in which both sides are right-with some brain tissue being sensitive to viewpoint, and some brain tissue not being sensitive (see Figure 4.14). Moreover, the perceiver’s task may be crucial. Some neuroscience data suggest that categorization tasks (“Is this a cup?”) may rely on viewpoint-independent processing in the brain, while identification tasks (“Is this the cup I showed you before?”) may rely on viewpoint-dependent processing (Milivojevic, 2012). In addition, other approaches to object recognition are being explored (e.g., Hayward, 2012; Hummel, 2013; Peissig & Tarr, 2007; Ullman, 2007). Obviously, there is disagreement in this domain. Even so, let’s be clear that all of the available proposals involve the sort of hierarchical network we’ve been discussing. In other words, no matter how the debate about object recognition turns out, it looks like we’re going to need a network model along the lines we’ve considered. e. Demonstration 4.4: Face-Recognition Ability How good are you at remembering and recognizing faces? The chapter mentions that people differ in their ability to remember faces. If you’d like to know how your performance compares to others go to this web address for the Cambridge Face Memory Test:
Or you can download the materials for the Glasgow Face Matching Test here:
Either test will tell you how you compare to people in general. For what it’s worth, though, the so-called “super-recognizers” (some of them employed by London’s Metropolitan Police) get scores close to 100% accuracy on the Glasgow test! Face Recognition We began our discussion of network models with a focus on how people recognize letters and words. We’ve now extended our reach and considered how a network might support the recognition of three-dimensional objects. But there’s one type of recognition that seems to demand a different approach: the recognition of faces. Faces Are Special As we described at the start of this chapter, damage to the visual system can produce a disorder known as agnosia-an inability to recognize certain stimuli-and one type of agnosia specifically involves the perception of faces. People who suffer from prosopagnosia generally have normal photograph and correctly say whether the photo shows a face or vision. Indeed, they can look at a something else; they can generally say whether a face is a man’s or a woman’s, and whether it belongs to someone young or someone old. But they can’t recognize individual faces-not even of their own parents or children, whether from photographs or “live.” They can’t recognize the faces of famous performers or politicians. In fact, they can’t recognize themselves (and so they sometimes think they’re looking through a window at a stranger when they’re actually looking at themselves in a mirror). Often, this condition is the result of brain damage, but in some people it appears to be present from birth, without any detectable brain damage (e.g., Duchaine & Nakayama, 2006). Whatever its origin, prosopagnosia seems to imply the existence of special neural structures involved almost exclusively in the recognition and discrimination of faces. Presumably, prosopagnosia results from some problem or limitation in the functioning of this brain tissue. (See Behrman & Avidan, 2005; Burton, Young, Bruce, Johnston, & Ellis, 1991; Busigny, Graf, Mayer, & Rossion, 2010; Damasio, Tranel, & Damasio, 1990; De Renzi, Faglioni, Grossi, & Nichelli, 1991. For a related condition, involving an inability to recognize voices, see Shilowich & Biederman, 2016.)
The special nature of face recognition is also suggested by prosopagnosia. Some people seem to be “super-recognizers” and are magnificently recognition, Bobak, Hancock, & Bate, 2015; Davis, Lander, Evans, & Jansari, 2016; Russell, Duchaine, & Nakayama, a pattern that is the opposite of accurate in face even though they have no special advantage in other perceptual or memory tasks (e.g., 2009; Tree, Horry, Riley, & Wilmer, 2017). These people are consistently able to remember (and recognize) faces that they viewed only briefly at some distant point in the past, and they’re also more successful in tasks that require “face matching”-that is, judging whether two different views of a face actually show the same person. There are certainly advantages to being a super-recognizer, but also some disadvantages. On the plus side, being able to remember faces is obviously politician eyewitnesses (e.g., in selecting a culprit special unit of super-recognizers or a sales person; benefit for a super-recognizers also seem to be much more accurate as from a police lineup). In fact, London’s police force now has a involved in many aspects of crime investigation (Keefe, 2016). On the downside, being a super- someone and cheerfully produce some social awkwardness. Imagine approaching recognizer announcing, “I know you! You used to work at the grocery store on Main Street.” The other person (who, let’s say, did work in that grocery eight years earlier) might find this puzzling, perhaps creepy can and maybe alarming.
What about the rest of us- nor super-recognizers? It turns out that people differ widely in their ability to remember and recognize faces (Bindemann, Brown, Koyas, & Russ, 2012; DeGutis, Wilmer, Mercado, & Cohan, 2013; Wilmer, 2017). These differences help from person to person, are easy to measure, and there are online face memory tests that can you find out whether you’re someone who has trouble recognizing faces. (If you’re curious, point your browser at the Cambridge Face Memory Test.) In all people, though, face recognition seems to involve processes different from those used for other forms of recognition. For example, we’ve mentioned the debate about whether recognition of houses, or teacups, or automobiles is viewpoint-dependent. There is no question about this issue, however, when we’re considering faces: Face recognition is strongly dependent on orientation, and so it shows a powerful inversion effect. In one study, four categories of stimuli were considered- right-side-up faces, upside-down faces, right-side-up pictures of common objects other than faces, and upside-down pictures of common objects. As Figure 4.15 shows, performance suffered for all of the upside-down (i.e., inverted) stimuli. However, this effect was much larger for faces than for other kinds of stimuli (Bruyer, 2001; Yin, 1969). Moreover, with non-faces, the (relatively small) effect of inversion becomes even smaller with practice; with faces, the effect of inversion remains in place even after practice (McKone, Kanwisher, & Duchaine, 2007). The role of orientation in face recognition can also be illustrated informally. Figure 4.16 shows two upside-down photographs of former British prime minister Margaret Thatcher (from Thompson, 1980). You can probably tell that something is odd about them, but now try turning the book upside down so that the faces are right side up. As you can see, the difference between the faces is striking and yet this fiendish contrast is largely lost when the faces are upside down. (Also see Rhodes, Brake, & Atkinson, 1993; Valentine, 1988.) Plainly, then, face recognition is strongly dependent on orientation in ways that other forms of an ongoing debate. object recognition are not. Once again, though, we need to acknowledge According to some authors, the recognition of faces really is in a category by itself, distinct from all other forms of recognition (e.g., Kanwisher, McDermott, & Chun, 1997). Other authors, however offer a different perspective: They agree that face recognition special but argue that certain other types of recognition, in addition to faces, are special in the same way. As one line of evidence, they argue that prosopagnosia isn’t just a disorder of face recognition. In one case, for example, prosopagnosic bird-watcher lost not only the ability to recognize faces but also the ability to distinguish the different types of warblers (Bornstein, 1963; Bornstein, Sroka, & Munitz, 1969). Another patient with prosopagnosia lost the ability to tell cars apart; she can locate her car in a parking lot only by reading all the license plates until she finds her own (Damasio, Damasio, & Van Hoesen, 1982). Likewise, in Chapter 2, we mentioned neuroimaging data showing that a particular brain site-the fusiform face area (FFA)-is specifically responsive to faces. (See, e.g., Kanwisher & Yovel, 2006. For a description of other brain areas involved in face recognition, see Gainotti & Marra, 2011.) One study, however, suggests that tasks requiring subtle distinctions among birds, or among cars, can also produce high levels of activation in this brain area (Gauthier, Skudlarski, Gore, & Anderson, 2000; also Bukach, Gauthier, & Tarr, 2006). This finding suggests that the neural tissue “specialized” for faces isn’t used only for faces. (For more on this debate, see, on the one side, Grill-Spector, Knouf, Kanwisher, 2004; McKone et al., 2007; Weiner & Grill-Spector, 2013. On the other side, see McGugin, Gatenby, Gore, & Gauthier, 2012; Richler & Gauthier, 2014; Stein, Reeder, & Peeler, 2016; Wallis, 2013 Zhao, Bülthoff, & Bülthoff, 2016.)
What should we make of all this? There’s no question that humans have a specialized recognition system that’s crucial for face recognition. This system certainly involves the FFA in the brain, and damage to this system can cause prosopagnosia. What’s controversial is how exactly we should describe this system. According to some authors, the system is truly a face recognition system and will be used for other stimuli only if those stimuli happen to be “face-like” (see Kanwisher & Yovel, 2006). According to other authors, this specialized system needs to be defined more broadly: It is used whenever you are trying to recognize specific individuals within a highly familiar category (e.g., Gauthier et al., 2000). The recognition of faces certainly has these traits (e.g., you distinguish Fred from George from Jacob within the familiar category of “faces”), but other forms of recognition may have the same traits (e.g., if a bird-watcher is distinguishing different types within the familiar category of “warblers”). So far, the data don’t provide a clear resolution of this debate; both sides of the argument have powerful evidence supporting their view. But let’s focus on the key point of agreement: Face recognition is achieved by chapter. We need to ask, therefore, how face recognition proceeds. a process that’s different from the process described earlier in this chapter. We need to ask, therefore, how face recognition proceed. Holistic Recognition The networks we’ve been considering so far all begin with an analysis of a pattern’s parts (e.g., features, geons); the networks then assemble those parts into larger wholes. Face recognition, in contrast, seems not to depend on an inventory of a face’s parts; instead, this process seems to depend on holistic perception of the face. In other words, face recognition depends on the face’s overall configuration-the spacing of the eyes relative to the length of the nose, the height of the forehead relative to the width of the face, and so on. (For more on face recognition, see Bruce & Young, 1986; Duchaine & Nakayama, 2006; Hayward, Crookes, Chu, Favelle, & Rhodes, 2016.)
Of course, a face’s features still matter in this holistic process. The key, however, is that the features can’t be considered one by one, apart from the context of the face. Instead, the features matter because of the relationships they create. It’s the relationships, not the features on their own that guide face recognition. (See Fitousi, 2013; Rakover, 2013; Rhodes, 2012; Wang, Li, Fang, Tian, & Liu, 2012, but also see Richler & Gauthier, 2014. For more on holistic perception of facial movement, see Zhao & Bülthoff, 2017.) Some of the evidence for this holistic processing comes from the composite effect in face recognition. In an early demonstration of this effect, Young, Hellawell, and Hay (1987) combined the top half of one face with the bottom half of another, and participants were asked to identify just the top half. This task is difficult if the two halves are properly aligned. In this setting, participants seemed unable to focus only on the top half; instead, they saw the top of the face as part of the whole (see Figure 4.17A). Thus, in the figure, it’s difficult to see that the top half of the face is Hugh Jackman (shown in normal view in Figure 4.17C). This task is relatively easy, though, if the halves are misaligned (as in Figure 4.17B). Now, the stimulus itself breaks up the configuration, making it possible to view the top half on its own. (For related results, see Amishav & Kimchi, 2010; but also see Murphy, Gray, & Cook, 2017. For evidence that the strength of holistic processing is predictive of face-recognition accuracy, see Richler, Cheung, & Gauthier, 2011. For a complication, though, see Rezlescu, Susilo, Wilmer, & Caramazza, 2017.) More work is needed to specify how the brain detects and interprets the relationships that define each face. Also, our theorizing will need to take some complications into account-including the fact that the recognition processes used for familiar faces may be different from the processes used for faces you’ve seen only once or twice (Burton, Jenkins, & Schweinberg, 2011; Burton, Schweinberger, Jenkins, & Kaufmann, 2015; Young & Burton, 2017). Evidence suggests that in recognizing familiar faces, you rely more heavily on the relationships among the internal features of the face; for unfamiliar faces, you may be more influenced by the face’s outer parts such as the hair and the overall shape of the head (Campbell et al., 1999).
Moreover, psychologists have known for years that people are more accurate in recognizing faces of people from their own racial background (e.g., Caucasians looking at other Caucasians, or Asians looking at other Asians) than they are when trying to recognize people of other races (e.g., Meissner & Brigham, 2001). In fact, some people seem entirely prosopagnosic when viewing faces of people from other groups, even though they have no difficulty recognizing faces of people from their own group (Wan et al., 2017). These points may suggest that people rely on different mechanisms for, say, “same-race” and “cross-race” face perception, and this point, too, must be accommodated in our theorizing. (For recent discussions, see Horry, Cheong, & Brewer, 2015; Wan, Crookes, Reynolds Irons, & McKone, 2015.) Obviously, there is still work to do in explaining how we recognize our friends and family-not to mention how we manage to remember and recognize someone we’ve seen only once before. We know that face recognition relies on processes different from those discussed earlier in the chapter, and we know that these processes rely on the configuration of the face, rather than its individual to fill in the details of this holistic processing. (For features. More research is needed, though, examples of other research on memory for faces, see Jones & Bartlett, 2009; Kanwisher, 2006; Michel, Rossion, Han, Chung, & Caldara, 2006; Rhodes, 2012. For discussion of how these issues play out in the justice system, with evidence coming from eyewitness identifications, see Reisberg, 2014.) Top-Down Influences on Object Recognition We’ve now discussed one important limitation of feature nets. These nets can, as we’ve seen, accomplish a great deal, and they’re crucial for the recognition of print, three-dimensional objects in the visual environment, and probably sounds as well. But there are some targets-faces, and perhaps others-for which recognition depends on configurations rather than individual features.
It turns out, though, that there is another limit on feature nets, even if we’re focusing on the targets for which a feature net is useful-print, common objects, and so on. Even in this domain, feature nets must be supplemented with additional mechanisms. This requirement doesn’t undermine the importance of the feature net idea; feature nets are definitely needed as part of theoretical account. The key word, however, is “part” because we need to place feature nets within a larger theoretical frame. The Benefits of Larger Contexts Earlier in the chapter, we saw that letter recognition is improved by context. For example, the letter V is easier to recognize in the context “VASE,” or even the nonsense context “VIMP” than it is if presented alone. These are examples of “top-down” effects-effects driven by your knowledge and expectations. And these particular top-down effects, based on spelling patterns, are easily accommodated by the network: As we have discussed, priming (from recency and frequency of use) guarantees that detectors that have often been used in the past will be easier to activate in the future. In this way, the network “learns” which patterns are common and which are not, and it is more receptive to inputs that follow the usual patterns.
Other top-down effects, however, require a different type of explanation. Consider the fact that words are easier to recognize if you see them as part of a sentence than if you see them in isolation. There have been many formal demonstrations of this effect (e.g., Rueckl & Oden, 1986; Spellman, Holyoak, & Morrison, 2001; Tulving & Gold, 1963; Tulving, Mandler, & Baumal, 1964), but for our purposes an informal example will work. Imagine that we tell research participants, “Im about to show you a word very briefly on a computer screen; the word is the name of something that you can eat.” If we forced the participants to guess the word at this point, they would be unlikely to name the target word. (There are, after all, many things you can eat, so the chances are slim of guessing just the right one.) But if we briefly show the word “CELERY” we’re likely to observe a large priming effect; that is, participants are more likely to recognize “CELERY” with this cue than they would have been without the cue. Think about what this priming involves. First, the person needs to understand each of the words in the instruction. If she didn’t understand the word “eat” (e.g., if she mistakenly thought we had said, “something that you can beat”), we wouldn’t get the priming. Second, the person must understand the relations among the words in the instruction. For example, if she mistakenly thought we had said, “something that can eat you,” we would expect a very different sort of priming. Third, the person has to know some facts about the world-namely, the kinds of things that can be eaten without this knowledge, we would expect no priming.