Introduction

Famous musical melodies, such as “Row, Row, Row Your Boat” and “Rudolph the Red-Nosed Reindeer,” are frequently used in psychological research. For example, famous melodies have been used to assess the degree of cognitive impairments in various neurological disorders, such as Alzheimer’s disease. A sizeable body of work indicates that individuals with Alzheimer’s disease have preserved ability to identify melodies, despite memory deficits for other categories of stimuli (Cuddy & Duffin, 2005; Cuddy et al., 2012; Cuddy, Sikka, & Vanstone, 2015; Hsieh, Hornberger, Piguet, & Hodges, 2011; Vanstone et al., 2012). In contrast to Alzheimer’s disease, other work indicates that individuals with semantic dementia show a deficit in naming famous melodies (Hsieh et al., 2011; Johnson et al., 2011), although evidence of impaired melody recognition in semantic dementia has been mixed (Hailstone, Omar, & Warren, 2009; Omar, Hailstone, Warren, Crutch, & Warren, 2010). Overall, famous musical melodies have proved to be a useful stimulus for identifying cognitive functions that may be selectively preserved (or impaired) in various dementias (for review see Omar, Hailstone, & Warren, 2012).

Other neuropsychological studies have used famous melodies to investigate lexical and conceptual retrieval in patients with focal brain damage. For example, prior work indicates that patients with left temporal polar damage are impaired at naming famous persons, famous landmarks, and famous melodies, suggesting that the left temporal pole is a “heteromodal” convergence region for semantically unique items (Belfi, Kasdan, & Tranel, 2019; Belfi & Tranel, 2014; Schneider, Heskje, Bruss, Tranel, & Belfi, 2018). Famous melodies have also been used to identify apperceptive agnosia for music in patients with temporal lobe lesions (Ayotte, Peretz, Rousseau, Bard, & Bojanowski, 2000; Baird, Walker, Biggs, & Robinson, 2014). Such melodies have been used to investigate memory and language deficits (or a lack thereof) in other populations, including individuals with primary progressive aphasia (Macoir et al., 2016), amnesia due to herpes simplex encephalitis (Finke, Esfahani, & Ploner, 2012), and cochlear implant users (Olszewski, Gfeller, Froman, Stordahl, & Tomblin, 2005; Stordahl, 2002; Volkova, Trehub, Schellenberg, Papsin, & Gordon, 2014).

While famous melodies have clear utility for neuropsychological research, they are also used more generally to investigate perception and cognition in healthy populations. Famous melodies have been used to investigate differences between the perception and imagination of music (Herholz, Halpern, & Zatorre, 2012), and the distinction between recognition and naming, or ‘tip-of-the-tongue’ phenomena (Kostic & Cleary, 2009). Famous melodies are also a useful stimulus for investigating the timing of melody recognition (Bailes, 2010). For example, prior work indicates that individuals can recognize musical melodies within 500 ms or a few notes (Büdenbender & Kreutz, 2016; Dalla Bella, Peretz, & Aronoff, 2003; Filipic, Tillmann, & Bigand, 2010; Huijgen et al., 2015; Tillmann, Albouy, Caclin, & Bigand, 2014). Familiar melodies are also a frequently used stimulus for evoking autobiographical memories (Belfi, Karlan, & Tranel, 2016; Ford, Addis, & Giovanello, 2011; Janata, Tomic, & Rakowski, 2007).

While familiar melodies are used in a wide variety of psychological studies, there is currently no set of famous melodies that have been validated in a United States sample of participants. Various “famous” or “familiar” melody tests have been previously reported in the literature, but these stimulus sets were either created idiosyncratically to match the knowledge and musical preferences of individual case studies (Steinke, Cuddy, & Jakobson, 2001), contain only a small number of melodies (Liégeois-Chauvel, Peretz, Babaï, Laguitton, & Chauvel, 1998), and/or do not have associated normative data (Hsieh et al., 2011; Kostic & Cleary, 2009). Furthermore, several previously used sets of famous melodies were developed outside the United States. Although there are likely some melodies familiar to all Western musical listeners, such stimulus sets may not be entirely appropriate for a US-based participant group (for example, a melody set published in French; Peretz, Babai, Lussier, Hebert, & Gagnon, 1995).

While there are no standardized sets of famous musical melodies, there are similar stimulus sets focusing on other features of music besides familiarity. For example, there are multiple sets of melodies chosen or composed to represent different emotions (e.g., a “happy” or a “sad” melody, or melodies with a positive or negative valence). There are numerous sets of such emotional melodies, consisting of either sung or instrumental music (Eschrich, Münte, & Altenmüller, 2008; Koelsch et al., 2013; Lepping, Atchley, & Savage, 2016; Livingstone & Russo, 2018; Rainsford, Palmer, & Paine, 2018; Vieillard et al., 2008). There are also sets of musical stimuli used to study musical expectancy, which often consist of series of chord progressions (Koelsch, Gunter, Wittfoth, & Sammler, 2005). Outside of the realm of music, normed stimulus sets are quite common – for example, there are stimulus sets of visual stimuli representing different emotions (Kurdi, Loano, & Banaji, 2017; Lang, Bradley, & Cuthbert, 2008), everyday objects (Snodgrass & Vanderwart, 1980; Tranel, Logan, Frank, & Damasio, 1997) or unique objects (Tranel, Enekwechi, & Manzel, 2005); sets of words in English (Deyne, Navarro, Perfors, Brysbaert, & Storms, 2018; Scott, Keitel, Becirspahic, Yao, & Sereno, 2018) and other languages (Chedid et al., 2018); and sets of voices (Darcy & Fontaine, 2019; Zäske, Skuk, Golle, & Schweinberger, 2019).

In sum, while prior research has often relied on famous musical melodies to investigate various cognitive and perceptual processes, and while normed stimulus sets are frequently used in other sensory domains, a set of standardized famous melodies has not yet been developed. The goal of the present work was to design such a set of famous musical melodies. Some prior work using famous melodies has taken the approach of using ‘naturalistic’ excerpts – for example, work investigating music as an autobiographical memory cue has used excerpts of popular music from the Billboard charts (e.g., Belfi et al., 2016; Janata et al., 2007), and work investigating memory for emotional music has used film scores (Eschrich et al., 2008). In contrast to this approach, here we sought to create stimuli that isolated the melodies themselves. Therefore, melodies in the present stimulus set consist of a single line melody with no harmonic accompaniment and no lyrical content. In contrast to other stimulus sets of unfamiliar (or merely recognizable) melodies, the present set was designed to contain musical melodies that were highly likely to be both recognized and named. As the primary goal was to create a set of highly familiar melodies, participants rated melodies on their familiarity and provided a written response with the melody name. Ratings of age of acquisition, emotional valence, and emotional arousal were also collected. Below we first describe the development of the stimulus set and selection of melodies, followed by the collection of normative data and characterization of the melodies.

Stimulus development

First, we sought to obtain a broad array of famous melodies. Our initial stimulus set consisted of 52 melodies used in our previous work (Belfi & Tranel, 2014). These stimuli consist of familiar melodies such as “Happy Birthday” and “Row, Row, Row Your Boat” (see Belfi, Kasdan, & Tranel (2019) for the full list of the original 52 melodies). While this initial stimulus set contained a large number of famous melodies, we sought to exhaust any additional possible melodies that may have been missed in our prior study. To this end, we first conducted an initial survey to identify any additional melodies that could be added to the stimulus set.

Methods

Participants

This initial survey was conducted using Amazon’s Mechanical Turk (AMT). We restricted participation to workers in the United States who had completed at least 1000 previous AMT tasks and obtained approval ratings of at least 95%. Other inclusion criteria were that participants must be native English speakers and older than 18 years old. A total of 108 participants completed the task. Data from eight participants were excluded because they failed to provide appropriate answers to the questionnaire (for example, writing nonsense words), leaving a total of 100 participants. Of these 100 participants, 61 were men and 38 were women (one participant preferred not to answer). Participants ranged in age from 20 to 68 years old (M = 35.56, SD = 11.48). The task took approximately 5 min and participants were compensated at a rate of $0.75 for completing the survey.

Procedure

Participants were given the previously published list of 52 melodies (from Belfi & Tranel, 2014) and were asked to suggest five melodies that were not present on the list but that would fit the criteria of being famous melodies that are highly recognizable to a US audience. Participants saw five empty text boxes where they were to type the names of the five additional melodies.

Selection of melodies for final stimulus set

Our method of soliciting additions to the melody list could have potentially resulted in a total of 500 possible unique responses (100 participants each listing five melodies). We first removed responses that were clearly not melodies (e.g., blanks, responses such as “nice,” nonwords, punctuation marks, etc.). Next, we identified melodies that were already included on the stimulus set (e.g., several participants wrote “Somewhere Over the Rainbow” when “Over the Rainbow” was already included in the set). Other melodies listed were unique names, but nonunique melodies (e.g., the “ABC Song” is the same melody as “Twinkle, Twinkle Little Star”). Once these redundant melodies were eliminated, we next categorized the remaining melodies into eight categories or “genres” of melodies: children’s, patriotic, movie/TV, Christmas, religious, classical, pop, or “other” for melodies that did not fall into one of the above categories. Our goal was to create a final stimulus set with a roughly distributed set of melodies across these categories. We also wanted to avoid choosing too many melodies from the same musical artist (for example, “Beat It,” “Thriller” and “Billie Jean” were all commonly named melodies by Michael Jackson, as were “I Want to Hold Your Hand,” “Yellow Submarine” and other songs by The Beatles), so we limited the number of melodies to three per artist. Melodies that were named by more than one participant were more likely to be selected than melodies only named once; however, several melodies were selected that were only named once. Our goal was to select melodies that were roughly equally distributed among categories (for example, we did not want to select an overabundance of children’s melodies).

This selection process resulted in a total of 107 melodies comprising the final stimulus set (see the Appendix for the final list of melodies and their categorization). The 107 melodies in the final stimulus set contained the following number of melodies in each category: children’s (n = 20), patriotic (n = 11), movie/TV (n = 16), Christmas (n = 17), religious (n = 6), classical (n = 8), pop (n = 18), and other (n = 11).

Melody construction

Once the final list of melodies was determined, the next step was to create the audio files to use in the normative data collection phase. Melodies were constructed following our previously published procedures (Belfi & Tranel, 2014). Briefly, the software MuseScore was used (musescore.org) to create each melody, which were single-line melodies with no harmonic accompaniment in a MIDI piano timbre. Each melody consisted of roughly one to two musical phrases and were an average of 13.31 s long (5.35 SD, range, 6–37 s). This relatively large range in stimulus length is common in work on musical melodies (Larrouy-Maestri, Harrison, & Müllensiefen, 2019), as our goal was to maintain a similar amount of musical information per melody, as opposed to absolute time.

Normative data collection

We next sought to collect normative data on variables typically used for other stimulus sets, of melodies, images, and other categories of stimuli. To this end, we collected ratings on two emotional categories that have been frequently used in prior research: valence and arousal. As it has been hypothesized that melodies have a later age of acquisition than other categories of stimuli (Belfi et al., 2019), we also collected ratings on age of acquisition. Given that the focus of this stimulus set is highly familiar melodies, we collected ratings on familiarity. Finally, we sought to identify the “nameability” of each melody; therefore, we also asked participants to name each melody. To summarize, we collected normative ratings on the following variables: valence, arousal, familiarity, age of acquisition, and naming.

Methods

Participants

Our goal was to create a normed stimulus set of famous melodies, similar to other normed stimulus sets of emotional music (Livingstone & Russo, 2018; Vieillard et al., 2008). We sought to collect normative data from a wide range of participants, so that the stimuli would be suitable for use in a variety of research contexts. We collected data from two separate populations: undergraduate students at Missouri S&T who completed the study for research credits, and participants from Amazon’s Mechanical Turk (AMT) who completed the study for monetary compensation. A secondary goal of the present work is to compare the results obtained from the undergraduate vs. the AMT groups, to identify possible differences between these two populations.

In determining our target number of participants, we first looked at prior work developing musical stimulus sets. For example, the Ryerson Audio-Visual Database of Emotional Speech and Song (RADVESS; Livingstone & Russo, 2018) is a stimulus set of voice recordings expressing various emotions in speech and song. For the RADVESS database, each stimulus was rated ten times (i.e., by ten individual participants). Another frequently used set of emotional musical stimuli obtained ratings from 20 participants for each stimulus (Vieillard et al., 2008). Given our prior work indicating fairly large individual differences in ratings of musical stimuli (Belfi, 2019), we sought to be more exhaustive in the number of participants for the present stimulus set. Therefore, to be as thorough as possible, our target number was to have each stimulus in the Famous Melodies Stimulus Set rated by at least 20 to 30 participants from each population (AMT and undergraduate), for a total of at least 50 individual ratings per stimulus.

Undergraduate participants

A total of 206 undergraduate participants completed the task. The task was administered online to participants in the Psychology Department Subject Pool at Missouri S&T, who completed the task for course credit. Participants were excluded if they failed to pass a “foil” question, which asked them to provide a specific answer (e.g., participants heard a voice saying “Please type the word ‘banana’ on the following screen”). A total of 22 participants failed to pass this foil and were therefore excluded, resulting in a total N = 184 participants. This rate of exclusion (10.6%) is similar to that found in prior research using online participants (Meade & Craig, 2012). Our final set of undergraduate participants consisted of 132 men, 50 women, and 2 nonbinary individuals. The average age was 19.74 (SD = 1.61) and participants had, on average, 3.17 years of formal musical training (SD = 3.55). To alleviate participant fatigue and to prevent participants from making similar ratings across scales, each participant rated all melodies on a single variable (participants were randomly assigned to variables). This resulted in the following numbers of participants for each variable: valence (n = 34), arousal (n = 38), familiarity (n = 43), age of acquisition (n = 34), and naming (n = 35).

Amazon Mechanical Turk participants

A total of 191 AMT participants completed the task. We restricted participation to workers in the United States who had completed at least 1000 previous AMT tasks and obtained approval ratings of at least 95%. The task took approximately 30 min to complete and participants were compensated $3 for their time. A total of 37 participants failed the foil question, leaving a total of N = 154 participants. This rate of exclusion (19%) is slightly higher than that from the undergraduate sample, but not entirely unusual for online studies using AMT workers. The final set of AMT participants consisted of 89 men, 64 women, and one genderqueer individual. The average age was 38.48 (SD = 11.90) and participants had, on average, 2.33 years of formal musical training (SD = 3.62). As with the undergraduate group, to alleviate participant fatigue and to prevent participants from making similar ratings across scales, each participant rated all melodies on a single variable (participants were randomly assigned to variables). This resulted in the following numbers of participants for each variable: valence (n = 24), arousal (n = 31), familiarity (n = 44), age of acquisition (n = 28), and naming (n = 27).

The two groups (AMT and undergraduate) significantly differed in age and years of musical training. The AMT sample was significantly older than the undergraduate sample, t(335) = 21.07, p < 0.001, 95% CI:[16.98, 20.48], and the undergraduate sample had significantly more years of musical training, t(336) = – 2.03, p = 0.04, 95% CI:[ – 1.65, – 0.02]. The proportions of men and women in the two samples were also significantly different, X2 = 7.63, p = 0.005. There was a greater proportion of women in the AMT sample (41.8%) than the undergraduate sample (27.6%).

Procedure

The experimental task was created using JSPsych (de Leeuw, 2015) and implemented using PsiTurk (Gureckis et al., 2016). Upon providing informed consent, participants were randomly assigned to one of five ratings: valence, arousal, familiarity, age of acquisition, or naming. Participants completed only one of the five ratings across all stimuli, to minimize participant fatigue and allow for independence between ratings. The specific rating scales are as follows. Valence, arousal, familiarity, and age of acquisition were rated on Likert scales. For valence, participants were asked “How negative or positive is this melody?”, and responded on the following scale: “very negative, “somewhat negative,” “neither negative nor positive,” “somewhat positive,” “very positive.” For arousal, participants were asked “How relaxing or stimulating is this melody?” and responded on the following scale: “very relaxing,” “somewhat relaxing,” “neither relaxing nor stimulating,” “somewhat stimulating,” “very stimulating.” For familiarity, participants were asked “How familiar is this melody?” and responded on the following scale: “Not at all familiar,” “slightly familiar,” “somewhat familiar,” “moderately familiar,” “very familiar.” For age of acquisition, participants were asked “Estimate the age at which you first learned this melody” and responded on the following scale: “Never,” “age 0–2,” “age 3–4,” “age 5–6,” “age 7–8,” “age 9–10,” “age 11–12,” “age 13+”. For naming, participants were asked “What is the name of this melody?” and responded by typing the name into a blank text box. After rating all 107 melodies, participants completed a brief demographics questionnaire, which included age, gender, and years of musical training.

Data quantification

All rating scales were converted to numerical values for analysis, starting with a 0 for the item at the leftmost end of the scale. For age of acquisition, the response “never” was reverse-coded as an 8 (to denote a “later” age of acquisition). We conducted all analyses a second time, removing all “never” trials for age of acquisition, and this did not substantially change our results (the analyses described below include the “never” trials). For naming trials, the experimenters read through all typed responses and manually scored them as correct or incorrect, as in prior research (Belfi & Tranel, 2014). Briefly, if the participant’s response matched the correct name of the melody, it was scored a 1 for correct. If the response did not match the name of the melody, it was scored a 0 for incorrect. If a participant provided an alternate but correct name, these instances were also scored as correct (for example, “What Child is This” for “Greensleeves”).

Results

Normative ratings

The primary goal of this project was to create a normed set of famous melody stimuli. To this end, we calculated average ratings on each of the five variables: valence, arousal, familiarity, age of acquisition, and naming. Collapsed across both groups, the melody with the lowest valence was Scarborough Fair (M = 1.36), while the melodies with the highest valence were Star Wars Theme and The Entertainer (M = 3.44). The melody with the lowest arousal was Silent Night (M = 0.72) while the melody with the highest arousal was The Entertainer (M = 3.50). The melody with the lowest familiarity was This Little Light of Mine (M = 0.70) while the melodies with the highest familiarity were Here Comes the Bride and Jingle Bells (M = 3.90). The melody with the lowest age of acquisition was Old MacDonald (M = 2.30) while the melody with the highest age of acquisition was Sweet Caroline (M = 7.45). Finally, the melodies with the lowest naming percentage were YMCA and ABC (the Jackson Five song, not the "ABC Song" which shares its melody with Twinkle Twinkle Little Star) (0% correct), while the melody with the highest naming percentage was Happy Birthday (88.7% correct). See Appendix for a full list of melodies and the mean ratings on each of the five rating scales.

We also sought to investigate relationships among these variables by calculating Pearson’s correlations between each of the pairs of variables. There were significant correlations between the following variables: valence and arousal (r = 0.70, t(105) = 10.04, p < 0.001, 95% CI:[0.58, 0.78]), valence and familiarity (r = 0.48, t(105) = 5.65, p < 0.001, 95% CI:[0.32, 0.61]), valence and age of acquisition (r = – 0.52, t(105) = – 6.10, p < 0.001, 95% CI:[– 0.63, – 0.35]), valence and naming (r = 0.42, t(105) = 4.78, p < 0.001, 95% CI:[0.25, 0.56]), naming and familiarity (r = 0.78, t(105) = 12.99, p < 0.001, 95% CI:[0.69, 0.84]), naming and age of acquisition (r = – 0.77, t(105) = – 12.63, p < 0.001, 95% CI:[– 0.84, – 0.68]), and familiarity and age of acquisition (r = – 0.90, t(105) = – 21.86, p < 0.001, 95% CI:[– 0.93, – 0.86]). See Figure 1 for a graphical depiction of these correlations.

Fig. 1
figure 1

Correlations between the five measured variables. Lower panels depict scatterplots with a linear regression line superimposed. Gray band surrounding regression line indicates the 95% confidence interval. Diagonals depict histograms indicating the frequency of responses on each variable. Upper panels indicate the r values for the correlations

Cluster analysis

When creating the stimulus set, melodies were classified into one of the following categories: children’s, patriotic, movie/TV, Christmas, religious, classical, pop, or “other” for melodies that did not fall into one of the above categories. One question was whether these categories systematically differed in terms of their ratings on the five variables of interest. Our goal was to see whether or not we could ‘group’ melodies based on their normative ratings, and whether these groups would systematically map on to melody category. To this end, we conducted a k-means cluster analysis in an attempt to group melodies based on their normative ratings.

While we had an a priori prediction that our stimulus set would group into eight clusters (based on the eight pre-determined categories of children’s, patriotic, movie/TV, Christmas, religious, classical, pop, or “other”), we first sought to identify the optimal number of clusters based on our data. We first standardized the ratings by converting them from the raw ratings to z-scores. Next, we conducted two analyses to identify the optimal number of clusters, using the NbClust function from the NbClust package (Charrad, Ghazzali, Boiteau, & Niknafs, 2014) and the fviz_nbclust function from the factoextra package in R. First, we identified the optimal number of clusters using the “elbow” method, which identifies the optimal number of clusters based on the total within-cluster sum of squares (WSS), a measure of variation within each cluster. This method suggested that two was the optimal number of clusters, as two clusters had lower WSS than one cluster, but adding a third cluster did not substantially reduce the WSS (see Supplementary Figure 1A https://osf.io/wrqzm/). We also identified the optimal number of clusters using the silhouette method, which measures the similarity of (or, distance between) of objects within each cluster as compared to the similarity of items between clusters (Kaufman & Rousseeuw, 2005). A higher silhouette value indicates better clustering, such that within-cluster similarity is maximized while between-cluster similarity is minimized. This method also suggested two as the optimal number of clusters (see Supplementary Figure 1B https://osf.io/wrqzm/).

K-means clustering was performed using the kmeans function in R with a predetermined number of clusters set to two, based on our identification as two as the optimal number of clusters. There were 63 melodies in Cluster 1 and 44 melodies in Cluster 2. Cluster 1 consisted of melodies that were rated as more familiar, having a lower age of acquisition, and were more frequently named than melodies in Cluster 2. Melodies in Cluster 1 also had higher average ratings of valence and arousal, although these emotional variables were not as clearly different between the two clusters (Table 1). See Fig. 2 for a graphical depiction of these clusters.

Table 1 Mean (SD) values for each variable for the two clusters
Fig. 2
figure 2

Cluster visualization. Clusters 1 and 2 are mapped onto a naming and familiarity, and b valence and arousal. To preserve readability, not all melody names are included in the figure

We also sought to investigate whether certain categories of melodies were more likely to fall into one cluster or another: for example, whether Christmas melodies were more likely to fall in Cluster 1 than Cluster 2. We conducted a Chi-square test to evaluate differences in the proportion of melodies in each category between clusters. This revealed a significant difference between clusters X2 = 35.63, p < 0.001. Pairwise comparisons indicated that the only category that had significantly more melodies in Cluster 2 than 1 was “pop”. That is, pop melodies were rated proportionally less familiar, had a higher age of acquisition, and were less frequently named (and therefore over-represented in Cluster 2). See Fig. 3 for a graphical depiction of the proportion of melodies in each cluster, and see Appendix for the full list of melodies and their cluster designation.

Fig. 3
figure 3

Proportion of melodies in each category in Cluster 1 and 2. Cluster 1 is represented in red; Cluster 2 is represented in blue. All melody categories had a greater proportion of melodies in Cluster 1, except Pop and Religious melodies. Bar width represents total number of melodies in each category (e.g., there are more Children’s melodies in the stimulus set [n = 20] than religious melodies [n = 6]). Bar height represents the proportion of melodies in each cluster

Interrater reliability

We sought to assess interrater reliability across the five variables by calculating the intraclass correlation coefficient using the “icc” function from the “irr” package in R. We calculated our ICCs using a two-way model using the “agreement” measure (McGraw & Wong, 1996). As in prior work in similar stimulus sets, we calculated both ICCs measured from single and average ratings (Livingstone & Russo, 2018), and have presented these in Table 2. These results indicate poor to moderate agreement across the five variables for single measures, but very high agreement for all five variables for average measures.

Table 2 Interrater reliability for each of the five rating scales

Differences between AMT and undergraduate groups

Due to possible differences in ratings between the undergraduate and AMT groups, we provide these normative data for each group separately (see Supplementary Table 1 https://osf.io/wrqzm/). We first sought to examine whether the groups differed in their normative ratings on each stimulus. To this end, we conducted separate linear mixed-effects models for each variable (valence, arousal, familiarity, age of acquisition, and naming) using the lme4 package in R (Bates, Maechler, Bolker, & Walker, 2015). For each model, we considered group (AMT vs. undergraduate) and stimulus as fixed-effects, and included random intercepts for participants. Therefore, we treated each stimulus as a condition, and sought to identify differences between groups for each condition. The car package in R was used to calculate p values for the regression coefficients (Fox & Weisberg, 2011).

Most stimuli did not significantly differ between groups, although this depended on the variable. The two groups tended to agree more on ratings of arousal (only five out of 107 melodies significantly differed between groups), and disagreed most on familiarity (13 out of 107 melodies significantly differed). A few melodies were consistently rated differently across rating scales: For example, the Harry Potter Theme (“Hedwig’s Theme”) was more arousing, more familiar, had an earlier age of acquisition, and was named significantly more frequently by the undergraduate than the AMT group. Overall, arousal and valence ratings tended to be more consistent across groups than familiarity, age of acquisition, or naming. See Supplementary Table 1 for the full list of stimuli with their normative ratings for each groups and indications of which ratings differed between groups.

Discussion

The goal of the present work was to create a set of famous musical melodies with normative data on several cognitive and affective dimensions. First, our stimulus set was designed by asking a large online sample of participants to provide names of famous and highly recognizable melodies. The resulting Famous Melodies Stimulus Set contains 107 melodies that were rated by 338 participants. Melodies were rated on their perceived emotional valence, emotional arousal, familiarity, age of acquisition; in addition to these ratings, participants provided the name of each melody, and a corresponding correct naming percentage was calculated for each. Overall, the familiarity of our stimulus set was validated: Most melodies were rated as highly familiar and were typically named correctly. This suggests that the Famous Melodies Stimulus Set contains melodies that are familiar and recognizable to a United States population.

Despite our intended goal of creating a stimulus set of highly recognizable melodies, not all melodies were highly familiar. Cluster analysis revealed a subset of melodies that were rated as less familiar and were less likely to be named. This suggests that the stimulus set contains two sub-groups of melodies: One subgroup contains melodies that are highly familiar, likely to be named, and have an early age of acquisition. The second subgroup contains melodies that are less familiar, less likely to be named, and have a later age of acquisition. It may seem surprising that any of the melodies were rated as highly unfamiliar, since melodies were included precisely because participants suggested them as highly familiar and recognizable. To investigate what might be contributing to this paradoxical unfamiliarity of some of the melodies, we examined whether certain categories of melodies were more likely to fall into the unfamiliar cluster. In doing so, we found that melodies in the “pop” category were overrepresented in this cluster of less familiar melodies.

To interpret why “pop” melodies were rated as unfamiliar, despite being some of the most frequently named melodies in our initial development of the stimulus set (e.g., “Beat It” by Michael Jackson was named by multiple participants), it is important to consider the nature of the melodies in the set. Our stimuli were not naturalistic music, but instead, the melodies were created as a single-line melody in a piano keyboard timbre. Most listeners are likely not accustomed to hearing popular music in this form. In contrast, children’s tunes and Christmas songs are often sung a capella, and are arranged in many different ways and performed by many different artists, so listeners may be more familiar with the “pure” melody itself for those categories of melodies. Therefore, the abstracted nature of the “pure” melodies used here may make pop melodies more challenging to identify. One advantage of this variance in familiarity is that researchers can select stimuli with varying degrees of familiarity, based on the goals of their experiment. For example, prior work has typically compared performance between familiar and unfamiliar melodies (Baird & Samson, 2009; Dalla Bella et al., 2003). It also may be useful for researchers who wish to compare responses to the pure melody (from the present stimulus set) with the naturalistic music itself (for example, comparing the song “Beat It” with the melody version from the stimulus set).

While some melodies were rated as unfamiliar, the overall distribution of melodies in the familiarity rating space was skewed towards high familiarity. In contrast, responses on the other rating scales were more evenly distributed (see Fig. 1). This has the benefit of allowing researchers to pick stimuli for certain purposes. For example, researchers may want to select stimuli that are equally familiar, but vary in age of acquisition. Another feature of the stimulus set is that researchers can select stimuli that are highly familiar but not “nameable.” For example, “Battle Hymn of the Republic” was rated as highly familiar, but had a low percentage of correct naming. These types of melodies may be useful for research investigating differences between recognition (or familiarity) and identification (or naming) of melodies (Belfi & Tranel, 2014; Kostic & Cleary, 2009). While we identified two clusters of melodies, one of which contained highly familiar, and the other less familiar melodies, emotional variables did not vary between the two clusters. This suggests that emotional arousal and valence are distinct from melody familiarity. Therefore, researchers may also select pieces that are matched on valence and arousal, but differ in terms of familiarity or naming ability.

Although there were some melodies rated as highly familiar but with low rates of correct naming, we generally found strong correlations between these variables. That is, melodies that were more familiar were more likely to be named. Familiarity also had a strong inverse relationship with age of acquisition: Melodies that were highly familiar were more likely to be learned at an early age. These correlations were perhaps strikingly high, particularly since each participant rated all melodies on only one rating scale (so the ratings were made independently across rating dimensions). This strong relationship between familiarity and age of acquisition is consistent with prior research suggesting that an early age of acquisition facilitates lexical retrieval for item names, and that items learned later in life are more susceptible to loss (Bell, Davies, Hermann, & Walters, 2000; Tzortzis, Goldblum, Dang, Forette, & Boller, 2000).

Despite high correlations among some variables, such as age of acquisition and familiarity, we found mixed interrater reliability. For all variables, interrater reliability was low to moderate for ‘single’ measures. These values are similar to measures of interrater reliability for other musical stimulus sets (Livingstone & Russo, 2018). In contrast, we found high interrater reliability for ‘average’ measures. For both single and average measures, interrater reliability tended to be higher for familiarity and age of acquisition than emotional valence and arousal. The interrater reliability values found here are consistent with our prior work investigating emotional and aesthetic judgments of music (Belfi, 2019) and poetry (Belfi, Vessel, & Starr, 2017). In both cases, interrater reliability was relatively low for judgments of valence, arousal, and aesthetic appeal. It seems to be the case that aesthetic objects, such as music, tend to have lower interrater reliability, which may reflect the fact that individual differences play a large role in musical preference (North, 2010; Palmer & Griscom, 2013).

A secondary goal of the present work was to compare our results from an undergraduate sample to a sample of participants from AMT. We found that responses from AMT and undergraduate participants were quite similar, despite differences in demographic variables between the two samples. There were, however, a few notable differences between the two samples: For example, the Harry Potter theme (Hedwig’s Theme) was consistently rated differently between the two groups. The undergraduate sample rated this melody as significantly more arousing, more familiar, and as having a lower age of acquisition than the AMT group. The undergraduate group also named this melody significantly more frequently (91%) than the AMT group (41%). This difference in this specific melody possibly reflects age differences between the undergraduate and AMT samples. With some notable exceptions such as this, there were few melodies that significantly differed between groups on multiple rating scales. We therefore feel that the Famous Melodies Stimulus Set is suitable for use in a range of experimental settings. See Supplementary Table 1 for the full list of stimuli with their normative ratings separated for each group.

The current stimulus set improves on prior work by developing a normed set of famous melodies with ratings on several dimensions. However, this work is not without limitations. First, while we attempted to collect a large number of famous melodies for use in our stimulus set, it was not a completely exhaustive set of all highly familiar melodies. Furthermore, the majority of the melodies here are taken from lyrical songs. Lyrical melodies greatly outnumber non-lyrical in our set (89 with lyrics, 18 without), to the point where comparing data between lyrical and non-lyrical melodies would be difficult. We suspect that our stimulus set does not contain many non-lyrical melodies because participants are often familiar with such melodies, but are unable to name them (such as highly familiar classical music). Since our stimulus selection procedure relied on participants naming melodies (and our stimulus set was created with the goal of containing melodies that could be named), this likely excluded many of these familiar non-lyrical melodies. Therefore, the Famous Melodies Stimulus Set would likely be less appropriate for investigating research questions aimed at comparing lyrical versus non-lyrical music. And although the Famous Melodies Stimulus Set contains normative ratings of emotional features (valence and arousal) these stimuli would also likely be less suitable for use as a purely emotion-inducing stimulus. While it is not unlikely that hearing these melodies could induce emotions, it may be that they induce such emotions via autobiographical memories, feelings of nostalgia, or other semantic associations. Additionally, the present stimulus set had relatively low interrater reliability on the emotional ratings. Therefore, if a researcher were interested in studying the ‘pure’ effect of music on emotions, it would be more appropriate to use stimulus sets designed specifically to evoke particular emotions (e.g., Lepping et al., 2016).

Although there are certainly types of studies where other musical stimuli would be more suitable, the Famous Melodies Stimulus Set has a wide range of applications for the study of cognitive functions such as memory and language. As has been shown in prior work, these melodies could be used to investigate memory for melodies in dementia (Cuddy & Duffin, 2005; Hsieh et al., 2011) or conceptual and lexical retrieval in lesion patient populations (Belfi et al., 2019; Belfi & Tranel, 2014). These stimuli could be particularly useful in other neuropsychological studies, for example studies examining singing behavior in persons with aphasia (Kasdan & Kiran, 2018). They could also be used to study feelings of familiarity for melodies (Filipic et al., 2010), or memory for musical lyrics. The melodies in the Famous Melodies Stimulus set all consist of single-line piano MIDI melodies, and could therefore provide an interesting set to compare to more ‘naturalistic’ musical excerpts.

To conclude, famous melodies are frequently used in research in both healthy and clinical populations. The Famous Melodies Stimulus Set has several distinct advantages over prior work: First, the present stimulus set has been validated on a large set of participants from the United States. As the goal of the present work was to create a set of stimuli that were highly familiar, it is important to create a regionally validated stimulus set. As prior research has shown, musical familiarity is strongly driven by cultural influences, which can affect music perception in many ways (Hannon, Soley, & Ullal, 2012; Teo, Hargreaves, & Lee, 2016). Although prior work has used sets of famous musical stimuli, some of these were normed in other countries, making them less suitable for US-based researchers. In addition to having the distinct advantage of being normed on a US-based sample, these stimuli are also openly available. They are accompanied by normative data from a large number of subjects on several dimensions: emotional valence, emotional arousal, age of acquisition, familiarity, and naming. Unlike other musical stimulus sets, which have typically just focused on emotional features of music, the Famous Melodies Stimulus Set has both emotional ratings as well as ratings typically associated with everyday objects (such as tools, animals, or persons), which will allow for a broad usage of the Famous Melodies Stimulus set. Finally, each melody was rated by a large number of participants, and the ratings are unconfounded, since raters rated each melody on only one rating dimension. We hope that the Famous Melodies Stimulus Set will serve as a valuable resource for researchers studying all aspects of music cognition and perception, as well as cognitive functioning more broadly.