AI-generated poetry is indistinguishable from human-written poetry and is rated more favorably

Study 1: distinguishing AI-generated from human-written poems

As specified in our pre-registration (https://osf.io/5j4w9), we predicted that participants would be at chance when trying to identify AI-generated vs. human-written poems, setting the significance level at 0.00518; p’s between 0.05 and 0.005 are “suggestive”. Observed accuracy was in fact slightly lower than chance (46.6%, χ2(1, N = 16340) = 75.13, p < 0.0001). Observed agreement between participants was poor, but was higher than chance (Fleiss’s kappa = 0.005, p < 0.001). Poor agreement suggests that, as expected, participants found the task very difficult, and were at least in part answering randomly. However, as in10, the below-chance performance and the significant agreement between participants led us to conclude that participants were not answering entirely at random; they must be using at least some shared, yet mistaken, heuristics to distinguish AI-generated poems from human-written poems.

Participants were more likely to guess that AI-generated poems were written by humans than they were for actual human-written poems (χ2(2, N = 16340) = 247.04, w = 0.123, p < 0.0001). The five poems with the lowest rates of “human” ratings were all written by actual human poets; four of the five poems with the highest rates of “human” ratings were generated by AI.

We used a general linear mixed model logistic regression analysis (fit to a binomial distribution) to predict participant responses (“written by a human” or “generated by AI”) with poem’s authorship (human or AI), the identity of the poet, and their interaction as fixed effects. We used a sum coding for the identity of the poet, to interpret more easily the main effect of authorship across poets. As specified in our pre-registration, we initially included three random effects: random intercepts for participants (since we took 10 repeated measurements, one per poem, for each participant), random intercepts for poems, and random slopes for the identity of the poet for each poem. Following19, we used principal component analysis (PCA) to check for overparameterization, and determined that the model was indeed overparameterized. PCA indicated that the random intercept for participants and the random slope for the identity of the poet were unnecessary and were causing the overparameterization. This conclusion is borne out in the data; looking at the proportion of “written by a human” responses for each participant, the variance is only 0.021; the variance between poets is only 0.00013. The lower-than-expected variance in the data simply does not support the complex random-effects structure. We therefore fit a reduced model with random intercepts for poems as the only random effect. Using ANOVA to compare model fit, we found that the full model containing our original set of random effects (npar = 76, AIC = 22385, BIC = 22970, logLik = -11116) did not provide a significantly better fit than the reduced model (npar = 21, AIC = 22292.5, BIC = 22454.2, logLik = -11125.2). We therefore proceed with the reduced model.

The total explanatory power of the model was low (Conditional R2 = 0.024, Marginal R2 = 0.013), reflecting the expected difficulty of the discrimination task and the fact that, as a result, participants’ answers differed only slightly from chance. Consistent with the deviation from chance in overall accuracy, authorship was significantly predictive of participant responses (b = -0.27716, SE = 0.04889, z = -5.669, p < 0.0001): being written by a human poet decreased the likelihood that a participant would respond that the poem was written by a human poet. The odds that a human-written poem is judged to be human-written are roughly 75% that of an AI-generated poem being judged human-authored (OR = 0.758). Full results can be found in our supplementary materials.

As an exploratory analysis, we refit the model with the addition of several variables reflecting structural features of the stimuli. Following10, which found that participants use flawed heuristics based on grammar and vocabulary cues to identify AI-generated texts, we examined whether participants look to structural and grammatical features of the poems to determine authorship. To test this, we added to the previous model stimulus word count (scaled), stimulus line count (scaled), stimulus all-lines-rhyme (a binary variable indicating whether or not all lines in the poem end with a rhyme), stimulus quatrain (a binary variable indicating whether the poem was formatted entirely in four-line stanzas, i.e., “quatrains”), and stimulus first person (a variable reflecting whether or not the poem was written in first person, with 3 values: “I” if written in singular first person, “we” if written in plural first person, and “no” if not written in first person).

As expected, the total explanatory power of the model was low (Conditional R2 = 0.0024, Marginal R2 = 0.017). None of the structural features were significantly predictive, but both stimulus line count (b = 0.1461249, SE = 0.0661922, z = 2.208, p = 0.02727) and stimulus all-lines-rhyme (b = 0.2084246, SE = 0.0861658, z = 2.419, p = 0.01557) were suggestive. The effect of authorship (b = -0.1852979, SE = 0.0914278, z = -2.027, p = 0.04269) also appears to be somewhat weakened by the poem structural features; controlling for the structural features, the estimated odds of a human-authored poem being judged human-authored are roughly 83% that of an AI-generated poem (OR = 0.831). This suggests that participants are using some shared heuristics to discriminate AI-generated poems from human-written poems; they may take AI to be less able to form rhymes, and less able to produce longer poems. If so, these heuristics are flawed; in our dataset, AI-generated poems are in fact more likely to rhyme at all lines: 89% of our AI-generated poems rhyme, while only 40% of our human-written poems rhyme. There is also no significant difference in average number of lines between AI-generated poems and human-written poems in our dataset.

The effect of experience with poetry

We asked participants several questions to gauge their experience with poetry, including how much they like poetry, how frequently they read poetry, and their level of familiarity with their assigned poet. Overall, our participants reported a low level of experience with poetry: 90.4% of participants reported that they read poetry a few times per year or less, 55.8% described themselves as “not very familiar with poetry”, and 66.8% describe themselves as “not familiar at all” with their assigned poet. Full details of the participant responses to these questions can be found in table S1 in our supplementary materials.

In order to determine if experience with poetry improves discrimination accuracy, we ran an exploratory model using variables for participants’ answers to our poetry background and demographics questions. We included self-reported confidence, familiarity with the assigned poet, background in poetry, frequency of reading poetry, how much participants like poetry, whether or not they had ever taken a poetry course, age, gender, education level, and whether or not they had seen any of the poems before. Confidence was scaled, and we treated poet familiarity, poetry background, read frequency, liking poetry, and education level as ordered factors. We used this model to predict not whether participants answered “AI” or “human,” but whether participants answered the question correctly (e.g., answered “generated by AI” when the poem was actually generated by AI). As specified in our pre-registration, we predicted that participant expertise or familiarity with poetry would make no difference in discrimination performance. This was largely confirmed; the explanatory power of the model was low (McFadden’s R2 = 0.012), and none of the effects measuring poetry experience had a significant positive effect on accuracy. Confidence had a small but significant negative effect (b = -0.021673, SE = 0.003986, z = -5.437, p < 0.0001), indicating that participants were slightly more likely to guess incorrectly when they were more confident in their answer.

We find two positive effects on discrimination accuracy: gender, specifically “non-binary/third gender” (b = 0.169080, SE = 0.030607, z = 5.524, p < 0.0001), and having seen any of the poems before (b = 0.060356, SE = 0.016726, z = 3.608, p = 0.000309). These effects are very small; having seen poems before only increases the odds of a correct answer by 6% (OR = 1.062). These findings suggest that experience with poetry did not improve discrimination performance unless that experience allowed them to recognize the specific poems used in the study. In summary, Study 1 showed that human-out-of-the-loop AI-generated poetry is judged to be human-written more often than poetry written by actual human poets, and that experience with poetry does not improve discrimination performance. Our results contrast with those of previous studies, in which participants were able to distinguish the poems of professional poets from human-out-of-the-loop AI-generated poetry16, or that participants are at chance in distinguishing human poetry from human-out-of-the-loop AI-generated poetry17. Past research has suggested that AI-generated poetry needs human intervention to seem human-written to non-expert participants, but recent advances in LLMs have achieved a new state-of-the-art in human-out-of-the-loop AI poetry that now, to our participants, seems “more human than human.”

Study 2: evaluating AI-generated and human-generated poems

Our second study asks participants to rate each poem’s overall quality, rhythm, imagery, sound; the extent to which the poem was moving, profound, witty, lyrical, inspiring, beautiful, meaningful, and original; and how well the poem conveyed a specific theme, and how well it conveyed a specific mood or emotion. Each of these was reported on a 7-point Likert scale. In addition to these 14 qualitative assessments (which were selected by examining rules for “poetry explication”; see, e.g.,20), participants also answered whether the poem rhymed, with choices “no, not at all,” “yes, but badly,” and “yes, it rhymes well.”

As specified in our pre-registration (https://osf.io/82h3m), we predicted (1) that participants’ assessments would be more positive when told the poem is human-written than when told the poem is AI-generated, and (2) that a poem’s actual authorship (human or AI) would make no difference in participants’ assessments. We also predicted that expertise in poetry (as measured by the self-reported experience with poetry) would make no difference in assessments.

Fig. 1
figure 1

Ratings for the 14 Measures of Poetic Excellence.

Ratings of overall quality of the poems are lower when participants are told the poem is generated by AI than when told the poem is written by a human poet (two-sided Welch’s t(4571.552) = –17.398, p < 0.0001, pBonf < 0.0001, Meandifference = –0.814, Cohen’s d = -0.508, 99.5% CI –0.945 to –0.683), confirming earlier findings that participants are biased against AI authorship2,7,15. However, contrary to earlier work14,16,17 we find that ratings of overall quality are higher for AI-generated poems than they are for human-written poems (two-sided Welch’s t(6618.345) = 27.991, p < 0.0001, pBonf < 0.0001, Meandifference = 1.045, Cohen’s d = 0.671, 99.5% CI 0.941 to 1.150); Fig. 1 compares the ratings distributions for AI-generated poems and human-written poems. The same phenomenon – where ratings are significantly lower when told the poem is AI-generated but are significantly higher when the poem is actually AI-generated – holds for 13 of our 14 qualitative ratings. The exception is “original”; poems are rated as less original when participants are told the poem is generated by AI vs. being told the poem is written by a human (two-sided Welch’s t(4654.412) = -16.333, p < 0.0001, pBonf < 0.0001, Meandifference = -0.699, Cohen’s d = -0.478, 99.5% CI –0.819 to –0.579), but originality ratings for actually AI-generated poems are not significantly higher than for actually human-written poems (two-sided Welch’s t(6957.818) = 1.654, p = 0.098, pBonf = 1.000, Meandifference = 0.059, Cohen’s d = 0.040, 99.5% CI –0.041 to 0.160). The largest effect is on “rhythm”: AI-generated poems are rated as having much better rhythm than the poems written by famous poets (two-sided Welch’s t(6694.647) = 35.319, p < 0.0001, pBonf < 0.0001, Meandifference = 1.168, Cohen’s d = 0.847, 99.5% CI 1.075 to 1.260). This is remarkably consistent; as seen in Fig. 2, all 5 AI-generated poems are rated more highly in overall quality than all 5 human-authored poems.

Fig. 2
figure 2

Overall Quality Ratings for Study 2 Poems. (Error bars correspond to 99.5% confidence intervals. The vertical blue line corresponds to the mean rating across all poems and participants (4.7).)

We used a linear mixed effects model to predict the Likert scale ratings for each of our 14 qualitative dimensions. We used poem authorship (human or AI), framing condition (told human, told AI, or told nothing), and their interaction as fixed effects. As specified in our preregistration, we initially planned to include four random effects: random intercepts per participant, random slope of poem authorship per participant, random intercept per poem, and random slope of framing condition per poem. As in Study 1, we followed19 in checking the models for overparameterization; PCA dimensionality reduction revealed that the models were overparameterized, specifically because of the random slopes for framing condition per poem. An attempt to fit a zero-correlation-parameter model did not prevent overparameterization; we therefore fit a reduced model for each DV without the random slopes for framing condition. ANOVA comparisons between the full and reduced models for each DV found that the reduced model provided at least as good a fit for 12 of the 14 DVs: all except “original” and “witty”. We therefore proceed with the reduced model.

For 9 of our 14 qualities, human authorship had a significant negative effect (p < 0.005), with poems written by human poets rated lower than poems generated by AI; for 4 qualities the effect was negative, but merely suggestive (0.05 < p < 0.005). The only quality for which there is not even a suggestive negative authorship effect is “original” (b = -0.16087, SE = 0.10183, df = 29.01975, t = -1.580, p = 0.1250). For 12 of our 14 qualities, the “told human” framing condition had a significant positive effect, and poems are rated more highly when participants are told that the poem is written by a human poet; for “inspiring” (b = 0.21902, SE = 0.11061, df = 693.00000, t = 1.980, p = 0.04808) and “witty” (b = 0.28140, SE = 0.12329, df = 693.00024, t = 2.282, p = 0.02277) the effect is merely suggestive. For all 14 models, the explanatory power is substantial (conditional R-squared > 0.47). Detailed analysis for all qualities can be found in our supplementary materials.

Factor analysis of qualitative ratings

As specified in our pre-registration, we planned to factor analyze responses to the following scales: moving, profound, witty, lyrical, inspiring, beautiful, meaningful, original. However, we found higher-than-expected correlations among all of our qualitative ratings; polychoric correlations ranged from 0.472 to 0.886, with a mean of 0.77. Therefore, we performed factor analysis on all 14 qualitative ratings. Parallel analysis suggested 4 factors. We performed a maximum likelihood factor analysis with an oblique rotation; factor scores were estimated using the ten Berge method21.

Fig. 3
figure 3

Factor Loadings for each Qualitative Dimension.

Factor 1 is most heavily weighted towards “beautiful,” “inspiring,” “meaningful,” “moving,” and “profound”; we take it to correspond to the poem’s emotional quality, and call it “Emotional Quality.” Factor 2 is most heavily weighted towards “rhythm,” “lyrical,” and “sound”; we take it to be the poem’s formal, including structural or metrical, quality, and call it “Formal Quality.” Factor 3 is most heavily weighted towards “imagery,” “mood or emotion,” and “theme”; we take it to reflect the poem’s ability to capture a particular poetic “Atmosphere,” and we call it “Atmosphere.” Factor 4 is most heavily weighted toward “witty” and “original”; we take it to reflect how creative or unique the poem is, and we call it “Creativity.” Fig. 3 shows the factor loadings for each qualitative dimension.

Fig. 4
figure 4

Scores for the Four Factors for AI-Generated and Human-Written Poems.

For each of the four factors, we used a linear mixed effects regression to predict factor values for each participants’ rating of each poem, using the same fixed and random effects used for the 14 qualitative dimension DVs. We again found that the preregistered random effects overparameterized the models, and used the reduced models with no random slopes for framing condition.

Fig. 5
figure 5

Factor Scores for each Framing Condition.

We find that across all four factors, the explanatory power of the models is substantial (conditional R-squared > 0.5). The “told human” framing condition has a significant positive effect on all factors, and human authorship has a significant negative effect on 3 of the 4 factors. Figure 4 shows factor scores for human and AI authorship; Fig. 5 shows factor scores for each framing condition; the results for each of the 4 factor-prediction models, with the results for overall quality for comparison, can be found in Table 1.

Table 1 Regression Coefficients with 99.5% Confidence Intervals for 4 Factors and Overall Quality Ratings linear Mixed Effects Regression Models.

Using qualitative ratings to predict discrimination

As in Study 1, we also used a mixed effects logistic regression (fit to a binomial distribution) to predict participant responses to the discrimination question (“written by a human” or “generated by AI”) for participants in the “told nothing” framing condition. We included authorship (human or AI), stimulus line count (scaled), stimulus all-lines-rhyme, and stimulus first-person as fixed effects, with random intercepts for participants (dropping stimulus quatrain and stimulus first-person from the model we used in Study 1 due to high multicollinearity in Study 2-poem’s smaller set of 10 poems). As expected, explanatory power of the model was low (conditional R-squared: 0.071, marginal R-squared: 0.013), but as in Study 1, we found that stimulus authorship (b = -0.435689, SE = 0.125832, z = -3.462, p = 0.000535) was once again significantly predictive of participants’ responses: being written by a human poet decreased the likelihood that a participant would respond that the poem was written by a human poet, with the odds of a human-authored poem being judged human-authored less than two-thirds that of an AI-generated poem (OR = 0.647). This finding replicates the main result of our first study.

As an exploratory analysis, we also fit a model with our four factors Emotional Quality, Formal Quality, Atmosphere, and Creativity. We included authorship and these four factors as fixed effects, with random intercepts for participants. Effectively, this model replaces the structural features of the previous model (stimulus line count, stimulus all-lines-rhyme, and stimulus first-person) with qualitative features. The explanatory power of this model was higher (conditional R-squared: 0.240, marginal R-squared: 0.148), suggesting that qualitative features may have more influence than structural features on participants’ beliefs about a poem’s authorship. Atmosphere (b = 0.55978, SE = 0.11417, z = 4.903, p < 0.0001) was significantly predictive: higher scores for Atmosphere increased the likelihood that a participant predicted the poem was written by a human. We also found suggestive positive effects for Emotional Quality (b = 0.22748, SE = 0.11402, z = 1.995, p = 0.04604) and Creativity (b = 0.18650, SE = 0.07322, z = 2.547, p = 0.01087), suggesting that higher scores for Emotional Quality and Creativity may also increase the likelihood that participants predict a poem was written by a human poet. Importantly, in this model, unlike previous discrimination models, authorship has no negative effect (b = 0.23742, SE = 0.14147, z = 1.678, p = 0.09332). This suggests that the “more human than human” phenomenon identified in Study 1 might be caused by participants’ more positive impressions of AI-generated poems compared to poems authored by human poets; when accounting for these qualitative judgments, the “more human than human” phenomenon disappears.

In summary, Study 2 finds that participants consistently rate AI-generated poetry more highly than the poetry of well-known human poets across a variety of factors. Regardless of a poem’s actual authorship, participants consistently rate poems more highly when told that a poem is written by a human poet, as compared to being told that a poem was generated by AI. The preference for AI-generated poetry at least partially explains the “more human than human” phenomenon found in Study 1: when controlling for participants’ ratings, AI-generated poems are no longer more likely to be judged human.

Check Also

The Gen AI Bridge to the Future – Stratechery by Ben Thompson

In the beginning was the mainframe. In 1945 the U.S. government built ENIAC, an acronym …