Decoded

How adding a ‘Don’t know’ response option can affect cross-national survey results

Janakee Chavda — Wed, 06 Sep 2023 18:58:59 +0000

Pew Research Center illustration

Pew Research Center’s international surveys have typically not given respondents the explicit option to say that they don’t know the answer to a particular question. In surveys conducted face-to-face or by phone, we’ve instead allowed respondents to voluntarily skip questions as they see fit. Similarly, in self-administered surveys conducted online, respondents can skip questions by selecting “Next page” and moving on. But the results of many survey experiments show that people are much less likely to skip questions online than when speaking to interviewers in person or on the phone.

This skipping effect raises a question for cross-national surveys that use different modes of interviewing people: Would it be better to give respondents an explicit “Don’t know” option in online surveys for potentially low-salience questions even if the same is not done in face-to-face or phone surveys?

To explore this question, we conducted multiple split-form experiments on nationally representative online survey panels in the United States and Australia.¹ In both countries, we asked people about international political leaders and gave half of the sample the option to choose “Never heard of this person” while withholding this option for the other half of respondents.

Similarly, we asked questions in Australia assessing various elements of Chinese soft power such as the country’s universities, military, standard of living, technological achievements and entertainment. We offered “Not sure” as a response option to half the sample, while the other half was not shown this option.

In the sections that follow, we’ll evaluate the impact of the “Never heard of this person” and “Not sure” options (which we’ll jointly refer to as “Don’t know” options) and assess whether adding such options can improve comparability across survey modes.

How does adding a “Don’t know” option affect views of world leaders?

In both the U.S. and Australia, giving respondents the option to say they had never heard of certain world leaders resulted in more of them choosing that option. Specifically, we tend to see a shift of topline results from the more moderate response options – “Some confidence” and “Not too much confidence” – to the “Don’t know” option. This was especially the case for leaders who received the largest shares of “Don’t know” responses such as French President Emmanuel Macron, German Chancellor Olaf Scholz and Indian Prime Minister Narendra Modi.

For example, in the U.S., when respondents were asked about their confidence in Modi without a “Don’t know” option, 34% said they had some confidence in the Indian leader and 42% said they had not too much confidence in him. But when we offered respondents the option to say they had never heard of Modi, the share saying they had some confidence in him dropped 14 percentage points and the share saying they had not too much confidence in him dropped 17 points. On the other hand, the share of respondents who expressed either of the two more forceful response options – that they had a lot of confidence in Modi or no confidence at all in him – were comparable whether the “Don’t know” option was presented or not.

While Australians were generally less likely than Americans to say they had never heard of the leaders in question, the response patterns were broadly similar between the two countries. For these lesser-known leaders, the share of respondents who simply refused to answer the question also decreased when respondents were able to say they did not know the leader.

The experiment in the U.S. and Australia allowed us to observe other patterns. For example, in both countries, the groups most likely to choose the “Don’t know” answer option included women, people without a college degree and the youngest respondents.

How does adding a “Don’t know” option affect views of China’s soft power?

To assess how a “Don’t know” option might affect a different type of question, we ran the same experiment on a battery of questions in Australia that asked respondents to rate China’s universities, standard of living, military, technological achievements and entertainment (including movies, music and television). Respondents could rate each of these aspects of Chinese soft power as the best, above average, average, below average or the worst compared with other wealthy nations.

Here, too, adding a “Don’t know” option resulted in a shift away from the middle answer options and toward the “Don’t know” option. For example, when asked about Chinese universities without the “Don’t know” option, 47% of Australians rated them as average. But when we included the “Don’t know” option, the share of Australians saying this dropped to 33%. Similarly, the share of Australians who rated Chinese entertainment as average fell from 40% without the “Don’t know” option to 27% with it. There were few notable differences in the other response categories.

How does adding a “Don’t know” option change how we interpret cross-national findings?

We almost always analyze our international survey data in the context of how countries compare with one another. For example, we’d like to know if people in a certain country are the most or least likely to have confidence in a given leader. However, as shown above, including an explicit “Don’t know” option can meaningfully change our results and make it challenging to compare countries against one another in this way.

For the questions used in this experiment, adding a “Don’t know” option generally did not result in major shifts in terms of how countries compare with one another. The main exception was Americans’ confidence in Germany’s leader, Scholz.

Yet even in his case, the inclusion of the “Don’t know” option did not substantially change the major conclusions we might draw from cross-national comparisons: Americans’ confidence in Scholz is relatively low, especially compared with the broad confidence he inspires in the Netherlands, Sweden and Germany itself.

We see a few potential reasons as to why there is a limited effect on the overall pattern of results. For one, there is a wide range of opinions across the countries surveyed: The share expressing confidence in Scholz ranges from 76% in the Netherlands to just 16% in Argentina. The 11-point difference between responses with and without a “Don’t know” option in the U.S., therefore, is relatively small when compared with the overall range across the other 23 countries surveyed.

Also, including the “Don’t know” response did not seem to affect just positive or negative responses. Instead, there was similar attrition from the “Some confidence” and “Not too much confidence” categories into the “Don’t know” category.

Conclusion

We believe including an explicit “Don’t know” option allows respondents to reflect their opinions more precisely in online surveys. While there may be significant effects on our topline survey results, our experiment does not indicate that adding a “Don’t know” option would greatly change the conclusions we can draw from cross-national analysis. Nor does it show that demographic groups treated question formats differently. Therefore, we’re inclined to offer a “Don’t know” option on future online surveys that are part of our cross-national survey research efforts.

While previous research shows that respondents are more likely to refuse to answer questions in face-to-face and phone surveys than in online surveys, we do see potential for further inquiry as to how phone and face-to-face respondents would treat an explicit “Don’t know” on low-salience questions.

While the U.S. panel is fully online, the Australia panel is mixed-mode, and a small percentage (3%) of respondents elected to take the survey over the phone. They too were presented with an explicit “Don’t know” option in one condition and their responses are included in this analysis. ↩︎

The post How adding a ‘Don’t know’ response option can affect cross-national survey results appeared first on Decoded.

What Twitter users say versus what they do: Comparing survey responses with observed behaviors

Janakee Chavda — Wed, 28 Jun 2023 18:58:52 +0000

Pew Research Center illustration

Survey researchers often ask respondents about things they may have done in the past: Have they attended a religious service in the past year? When was the last time they used cash to pay for something? Did they vote in the most recent presidential election? But questions like these can put respondents’ memories to the test.

In many instances, we have no option but to ask these questions directly and rely on strategies to help with respondents’ recollection. But increasingly, we’re able to find out exactly what someone did instead of asking them to remember it. In the case of social media, for example, we can ask respondents how often they post on a site like Twitter. But we can also simply look up the answer if they have given us permission to do so.

Since space in a survey questionnaire is a precious resource, we generally don’t like to ask respondents about things we can easily find ourselves. But there are times when it’s useful to compare what respondents say in a survey with what we can discover by looking at their online activity. Having both pieces of information – what people say they did and what they actually did – can help us better understand things like respondent recall or how respondents and researchers think about and define the same activities.

This post explores the connection between Americans’ survey responses and their digital activity using data from our past research about Twitter. First, we’ll look at how well U.S. adults on Twitter can remember the number of accounts they follow (or that follow them) on the platform. Then we’ll compare survey responses and behavioral measures related to whether respondents post about politics on Twitter. This, in turn, can help us learn more about how respondents and researchers interpret and operationalize the concept of “political content.”

All findings in this analysis are based on a May 2021 survey of U.S. adults who use Twitter and agreed to share their Twitter handle with us for research purposes.

Followings and followers

In our May 2021 survey, we asked Twitter users to tell us how many accounts they follow on the site, as well as how many accounts follow them. We did not expect that they would know these figures off the top of their heads, and worried that asking for a precise number could prompt respondents to go to their Twitter accounts and look up the answers. So, we gave them response options in the form of ranges: fewer than 20 accounts, from 20 to 99, from 100 to 499, from 500 to 999, from 1,000 to 4,999 and more than 5,000 accounts. (We accidentally excluded the number 5,000 from the ranges listed in our survey instrument. Luckily for us, none of our respondents had exactly 5,000 followers, so this oversight did not come into play in any of the calculations that follow.)

After asking these questions, we used the Twitter API to look up each respondent’s profile on the site. We then performed a fairly simple calculation to see if the range each respondent chose in the survey was accurate to the number of followed accounts we found on their profile.

More than half of Twitter users who provided their handle (59%) chose a range that accurately reflected the number of accounts they follow. And two-thirds (67%) were able to accurately recall the number of accounts that follow theirs. So it seems that Twitter users have slightly more awareness of their own reach on the platform than of the number of accounts they’ve chosen to follow.

We also found that users were better at answering this question the more often they use Twitter. For instance, 69% of those who said they use Twitter every day were able to accurately report the range of accounts they follow. But that share fell to 49% among those who said they used it a few times a month or less often.

Frequency of political tweeting

Of course, asking people how many accounts they follow on Twitter is a concrete question with a definitive answer. Many other questions are more abstract.

Consider our 2022 study about politics on Twitter, in which we wanted to measure how often Twitter users talk about politics on the site based on the content of their tweets. Most people can probably agree that “talking about politics” includes someone expressing explicit support for a candidate for office or commenting on a new bill being debated in Congress. But what about someone expressing an opinion about a social movement or sharing an opinion piece about tax rates or inflation? Defining which tweets to count as “political” is not easy.

In earlier research on political tweeting, we used a fairly narrow definition that defined “political tweets” as those mentioning common political actors or a formal political behavior like voting. This definition makes it easy to identify relevant tweets, but it also leaves out many valuable forms of political speech. In our more recent study, we wanted to use a broader definition that was more in line with what people understand as political content. For this, we fine-tuned a machine learning model that was pretrained on a large corpus of English text data, and we validated it using 1,082 tweets hand-labelled by three annotators who were given a general prompt to identify any containing “political content.” (The complete details can be found in the methodology of the 2022 report.)

With that automatic classification of all our respondents’ tweets in hand, we wanted to know whether we were capturing tweets that they themselves would classify as political content. So we included a question in our survey asking respondents if they had tweeted or retweeted about a political or social issue in the last year; 45% said that they had done this. We then compared their responses to our classification of their tweets.

As it turns out, a large majority (78%) of those who said that they had tweeted about politics in the last year had posted at least one tweet classified as political content by our machine learning model. And 67% of these users had posted at least five such tweets in that time. This tells us that the definition operationalized in our machine learning model was catching a large percentage of the respondents who said that they engaged in the behavior we were studying.

However, this is only part of the story. When we looked at people who said they didn’t tweet or retweet about politics in the last year, we found that 45% of them did, in fact, post at least one tweet in the last year that our model classified as political – and 24% of them had posted five or more such tweets during that span. These results were similar if we changed the time horizon, using a separate question asking if users had ever tweeted or retweeted about political or social issues or if they had done so in the last 30 days.

In other words, either our definition of political content was more encompassing than the one used by some of our survey respondents, or some respondents simply didn’t recall that they had posted political content.

Conclusions

This exercise allows us to see how well U.S. Twitter users’ survey responses match up with their observed behaviors on the site. Sometimes the two measures align fairly well, as in the case of follower counts. In other situations, however, making a comparison is harder. Some questions, such as whether tweets are political or not, depend on the judgment of the respondents, and our operational definition may be different from theirs. That discrepancy is not always avoidable. Using a single definition to measure Twitter behavior among a large group of respondents means that some respondents may not recognize certain behaviors the same way we do.

The post What Twitter users say versus what they do: Comparing survey responses with observed behaviors appeared first on Decoded.

Assessing the effects of generation using age-period-cohort analysis

Janakee Chavda — Mon, 22 May 2023 15:58:18 +0000

Pew Research Center illustration

Opinions often differ by generation in the United States. For example, Gen Zers and Millennials are more likely than older generations to want the government to do more to solve problems, according to a January 2020 Pew Research Center survey. But will Gen Zers and Millennials always feel that way, or might their views on government become more conservative as they age? In other words, are their attitudes an enduring trait specific to their generation, or do they simply reflect a stage in life?

That question cannot be answered with a single survey. Instead, researchers need two things: 1) survey data collected over many years – ideally at least 50 years, or long enough for multiple generations to advance through the same life stages; and 2) a statistical tool called age-period-cohort (APC) analysis.

In this piece, we’ll demonstrate how to conduct age-period-cohort analysis to determine the effects of generation, using nearly 60 years of data from the U.S. Census Bureau’s Current Population Survey. Specifically, we’ll revisit two previous Center analyses that looked at generational differences in marriage rates and the likelihood of having moved residences in the past year to see how they hold up when we use APC analysis.

What is age-period-cohort analysis?

In a typical survey wave, respondents’ generation and age are perfectly correlated with each other. The two cannot be disentangled. Separating the influence of generation and life cycle requires us to have many years of data.

If we have data not just from 2020, but also from 2000, 1980 and earlier decades, we can compare the attitudes of different generations while they were passing through the same stages in the life cycle. For example, we can contrast a 25-year-old respondent in 2020 (who would be a Millennial) with a 25-year-old in 2000 (a Gen Xer) and a 25-year-old in 1980 (a Baby Boomer).

A dataset compiled over many years allows age and generation to be examined separately using age-period-cohort analysis. In this context, “age” denotes a person’s stage in the life cycle, “period” refers to when the data was collected, and “cohort” refers to a group of people who were born within the same time period. For this analysis, the cohorts we are interested in are generations – Generation Z, Millennials, Generation X and so on.

APC analysis seeks to parse the effects of age, period and cohort on a phenomenon. In a strictly mathematical sense, this is a problem because any one of those things can be exactly calculated from the other two (e.g., if someone was 50 years old in 2016, then we also know they were born circa 1966 because 2016 – 50 = 1966). If we only have data collected at one point in time, it’s not possible to identify conclusively whether apparent differences between generations are not just due to how old the respondents were when the data was collected. For example, we don’t know whether Millennials’ lower marriage rate in 2016 is a generational difference or simply a result of the fact that people ages 23 to 38 are less likely to be married, regardless of when they were born. Even if we have data from many years, any apparent trends could also be due to other factors that apply to all generations and age groups equally. We might mistakenly attribute a finding to age or cohort when it really should be attributed to period. This is called the “identification problem.” Approaches to APC analysis are all about getting around the identification problem in some way.

Using multilevel modeling for age-period-cohort analysis

A common approach for APC analysis is multilevel modeling. A multilevel model is a kind of regression model that can be fit to data that is structured in “groups,” which themselves can be units of analysis.

A classic example of multilevel modeling is in education research, where a researcher might have data on students from many schools. In this example, the two levels in the data are students and the schools they attend. Traits on both the student level (e.g., grade-point average, test scores) and the school level (e.g., funding, class size) could be important to understanding outcomes.

In APC analysis, things are a little different. In a dataset collected over many years, we can think of each respondent as belonging to two different but overlapping groups. The first is their generation, as determined by the year in which they were born. The second is the year in which the data was collected.

Fitting a multilevel model with groups for generation and year lets us isolate differences between cohorts (generations) and periods (years) while holding individual characteristics like age, sex and race constant. Placing period and cohort on a different level from age addresses the identification problem by allowing us to model all of these variables simultaneously.

Conducting age-period-cohort analysis with the Current Population Survey

One excellent resource that can support APC analysis is the Current Population Survey’s Annual Social and Economic Supplement (ASEC), conducted almost every year from 1962 to 2021. We’ll use data from the ASEC to address two questions:

Does the lower marriage rate among today’s young adults reflect a generational effect, or is it explained by other factors?
Does the relatively low rate of moving among today’s young adults reflect a generational effect, or is it explained by other factors?

In previous Center analyses, we held only age constant. This time, we want to separate generation not just from age, but also from period, race, gender and education.

Getting started

The data we’ll use in this analysis can be accessed through the Integrated Public Use Microdata Series (IPUMS). After selecting the datasets and variables you need, download the data (as a dat.gz file) and the XML file that describes the data and put them in the same folder. Then, use the package ipumsr to read the data in as follows:

library(tidyverse)
library(ipumsr)
asec_ddi <- read_ipums_ddi(ddi_file = "path/to/data/here.xml")
asec_micro <- read_ipums_micro(asec_ddi)

Next, process and clean the data. First, filter the data so it only includes adults (people ages 18 and older), and then create clean versions of the variables that will be in the model.

Here’s a look at the cleaning and filtering code we used. We coded anyone older than 80 as being 80 because the ASEC already coded them that way for some years. We applied that rule to every year in order to ensure consistency. We only created White, Black and Other categories for race because categories such as Hispanic and Asian weren’t measured until later. Finally, there were several years when the ASEC did not measure whether people moved residences in the past year, so we excluded those years from our analysis.

asec_rec <- asec_micro %>%
  # Filter to adults ages 18 or older
  filter(AGE >= 18) %>%
  
  # EDUC is completely missing for 1963 so exclude that year
  filter(YEAR != 1963) %>%
  
  # Drop an additional 8 cases for which education is coded as unknown
  filter(EDUC != 999) %>%
  
  # Remove a few hundred cases with weights less than or equal to 0
  filter(ASECWT > 0) %>%
  
  # Recode variables for analysis
  mutate(SEX = as_factor(SEX) %>% fct_drop(),
         
         EDUCCAT = case_when(EDUC < 80 ~ "HS or less",
                             EDUC >= 80 & EDUC <= 110 ~ "Some coll",
                             EDUC > 110 & EDUC < 999 ~ "Coll grad",
                             EDUC == 999 ~ NA_character_) %>%
           as_factor(),
         
         # Age was topcoded at 90 from 1988-2001, then at 80 in 2002 
         # and 2003, then at 85 from 2004-present. Since we're trying 
         # to use the entire dataset, age will be topcoded at 80 
         # across the board.
         AGE_TOPCODED = case_when(AGE >= 80 ~ 80,
                                  TRUE ~ as.numeric(AGE)),
         
         # White and Black are present for every single year, whereas 
         # categories like Hispanic and Asian weren't added until 
         # later.
         RACE_COLLAPSED = case_when(RACE == 100 ~ "White",
                                    RACE == 200 ~ "Black",
                                    TRUE ~ "Other") %>%
           as_factor(),
         
         # Calculate birth year from age and survey year
         YEAR_BORN = YEAR - AGE_TOPCODED,
         
         # Calculate generation from birth year
         GENERATION = case_when(YEAR_BORN <= 1945 ~ 
                                  "Silent and older",
                                YEAR_BORN %in% 1946:1964 ~ "Boomer",
                                YEAR_BORN %in% 1965:1980 ~ "Xer",
                                YEAR_BORN %in% 1981:1996 ~ 
                                  "Millennial",
                                YEAR_BORN >= 1997 ~ "Gen Z") %>%
           as_factor(),
         
         # Create binary indicators for outcome variables
         MARRIED = case_when(MARST %in% 1:2 ~ 1,
                             TRUE ~ 0),
         
         # Note that MIGRATE1 is missing for the years 1962, 
         # 1972-1975, 1977-1980, and 1985. The documentation also 
         # indicates that the 1995 data seem unusual.
         MOVED1YR = case_when(MIGRATE1 %in% 3:5 ~ 1,
                              is.na(MIGRATE1) ~ NA_real_,
                              TRUE ~ 0)) %>%
  
  # Group by YEAR and the household ID to calculate the number of 
  # adults in each respondent's household
  group_by(YEAR, SERIAL) %>%
  mutate(ADULTS = n()) %>%
  ungroup() %>%
  
  # Retain only the variables we need for this analysis
  select(YEAR, SERIAL, PERNUM, ASECWT, AGE, AGE_TOPCODED, YEAR_BORN, 
         GENERATION, SEX, EDUCCAT, RACE_COLLAPSED, MARRIED, MOVED1YR, 
         ADULTS)

The full ASEC dataset has more than 6.7 million observations, with around 60,000 to 100,000 cases from each year. Without the computing power to fit a model to the entire dataset in any reasonable amount of time, one option is to sample a smaller number of cases per year, such as 2,000. To ensure that the sampled cases are still representative, sample them proportionally to their survey weight.

set.seed(20220617)
asec_rec_samp <- asec_rec %>%
  group_by(YEAR) %>%
  slice_sample(n = 1000, weight_by = ASECWT)

Model fitting

Now that the data has been processed, it’s ready for model fitting. This example uses the rstanarm package to fit the model using Bayesian inference.

Below, we fit a multilevel logistic regression model with marriage as the outcome variable; with age, number of adults in the household, sex, race and education as individual-level explanatory variables; and with period and generation as normally distributed random effects that shift the intercept depending on which groups an individual is in.

library(rstanarm)
fit_marriage <- stan_glmer(MARRIED ~ AGE_TOPCODED + I(AGE_TOPCODED^2) 
                           + ADULTS + SEX + RACE_COLLAPSED + EDUCCAT +
                             (1 | YEAR) + (1 | GENERATION),
                           data = asec_rec_samp, 
                           family = binomial(link = "logit"),
                           chains = 4, 
                           iter = 1000,
                           cores = 4, 
                           refresh = 1,
                           adapt_delta = 0.97, 
                           QR = TRUE)

Regression models are largely made up of two components: the outcome variable and some explanatory variables. Characteristics that are measured on each individual in the data and that could be related to the outcome variable are potentially good explanatory variables. Every model also contains residual error, which captures anything that influences the outcome variable other than the explanatory variables; this can be thought of as encompassing unique qualities that make all individuals in the data different from one another. A multilevel regression model will also capture unique qualities that make each group in the data different from one another. The model may optionally include group-level explanatory variables as well.

The groups don’t need to be neatly nested within one another, allowing for flexibility in the kinds of situations to which multilevel modeling can apply. In our example, we model age as a continuous, individual-level predictor while modeling generation (cohort) and period as groups that each have different, discrete effects on the outcome variable. Separating age from period and cohort by placing them on different levels allows us to model all of them without running head-on into the identification problem. However, this is premised on a number of important assumptions that may not always hold up in practice.

By modeling the data like this, we are treating the relationship between period or generation and marriage rates as discrete, where each period or generation has its own distinct relationship. If there is a smooth trend over time, the model does not estimate the trend itself, instead looking at each period or generation in isolation. We are, however, modeling age as a continuous trend.

Separating generation from other factors

Our research questions above concern whether the differences by generation that show up in the data can still be attributed to generation after controlling for age, cohort and other explanatory variables.

In a report on Millennials and family life, Pew Research Center looked at ASEC’s data on people who were 23 to 38 years old in four specific years – 2019, 2003, 1987 and 1968. Each of these groups carried a generation label – Millennial, Gen X, Boomer and Silent – and the report noted that younger generations were less likely to be married than older ones.

Now that we have a model, we can reexamine this conclusion, decoupling generation from age and period. The model can return predicted probabilities of being married for any combination of variables passed to it, including combinations that didn’t previously exist in the data – and combinations that are, by definition, impossible. The model can, for example, predict how likely it is that a Millennial who was between ages 23 and 38 in 1968 would be married, even if no such person can exist. That’s useful not for what it represents in and of itself, but for what it can explain about the influence of generation on getting married.

In order to create predicted probabilities, the first requirement is to create a dataset on which the model will predict probabilities. This dataset should have every variable used in the model.

Here, we create a function that takes five inputs: the model, the original data, a generation category (cohorts), a range of years (periods) and a range of ages. This function will take everybody in the ASEC data for the given years and ages (regardless of whether they were in the subset used to fit the model) and set all their generations to be the same while keeping everything else unchanged, whether the resulting input makes sense in real life or not. This simulated ASEC data is then passed to the posterior_predict() function from rstanarm, which gives us a 2,000-by-n matrix, where n is the number of observations in the new data passed to it. We then compute the weighted mean of the predicted probabilities across the observations, giving us 2,000 weighted means. These 2,000 weighted means represent draws from a distribution estimating the predicted probability. We summarize this distribution by taking its median, as well as the 2.5th and 97.5th percentiles to create a 95% interval to express uncertainty.

We then run this function four times, once for each generation, on everyone in the ASEC data who was ages 23 to 38 during each of the four years we studied in the report. First, the function estimates the marriage rate among those people if, hypothetically, they were all Millennials, regardless of what year they appear in the ASEC data. Next, it estimates the marriage rate among those same people if, hypothetically, they were all Gen Xers. Then the function estimates the marriage rate if they were all Boomers or members of the Silent Generation, respectively.

# Helper function to get estimates for hypothetical scenarios
estimate_hypothetical <- function(fit, df, generation, seed) {
  
  # Set generation to the same value for all cases
  df$GENERATION <- factor(generation)
  
  # Get means for each 
  ppreds <- posterior_predict(fit, newdata = df, seed = seed)
  pmeans <- apply(ppreds, MARGIN = 1, weighted.mean, w = df$ASECWT)
  
  # Get median and 95% intervals for the posterior mean
  estimate <- quantile(pmeans, probs = c(0.025, 0.5, 0.975)) %>%
    set_names("lower95", "median", "upper95")
  
  return(estimate)
}


# Filter to the years and age ranges from the original analysis
analysis_subset_marriage <- asec_rec %>%
  filter(YEAR %in% c(2019, 2003, 1987, 1968), 
         AGE %in% 23:38)

# Get estimates for each generation
estimates_marriage <- c("Millennial", "Xer", "Boomer", "Silent and older") %>%
  set_names() %>%
  map_dfr(~estimate_hypothetical(fit = fit_marriage, 
                                df = analysis_subset_marriage, 
                                generation = .x, 
                                seed = 20220907), 
          .id = "GENERATION")

In statistical terms, we are drawing from the “posterior predictive distribution.” This allows us to generate estimates for hypothetical scenarios in which we manipulate generation while holding age, period and all other individual characteristics constant. While this is not anything that could ever happen in the real world, it’s a convenient and interpretable way to visualize how changing one predictor would affect the outcome (marriage rate).

Determining the role of generation

To answer the first of our research questions, let’s plot our predicted probabilities of being married. If we took all these people at each of these points in time and magically imbued them with everything that is unique to being a Millennial, the model estimates that 38% of them would be married. In contrast, if you imbued everyone with the essence of the Silent Generation, that estimate would be 68%. The numbers themselves aren’t important, as they don’t describe an actual population. Instead, we’re mainly interested in how different the numbers are from one another. In this case, the intervals do not overlap at all between the generations and there is a clear downward trend in the marriage rate. This is what we would expect to see if the lower marriage rate among Millennials reflects generational change that is not explained by the life cycle (age) or by other variables in the model (gender, race, education).

What about our second question about generational differences in the likelihood of having moved residences in the past year? The Center previously reported that Millennials were less likely to move than prior generations of young adults. That analysis was based on a similar approach that considered people who were ages 25 to 35 in 2016, 2000, 1990, 1981 and 1963, and identified them as Millennials, Gen Xers, “Late Boomers,” “Early Boomers” and Silents.

Using the ASEC dataset described above, we reexamined this pattern using age-period-cohort analysis. We fit the same kind of multilevel model to this outcome and plotted the predicted probabilities in the same way. In this case, there are no clear differences or trend across the generations, with overlapping intervals for the estimated share who moved in the past year. This suggests that the apparent differences between the generations are better explained by other factors in the model, not generation.

Conclusion

These twin analyses of marriage rates and moving rates illustrates several key points in generational research.

In some cases (e.g., moving), what looks like a generation effect is actually explained by other factors, such as race or education. In other cases (e.g., marriage), there is evidence of an enduring effect associated with one’s generation.

APC analysis requires an extended time series of data; a theory for why generation may matter; a careful statistical approach; and an understanding of the underlying assumptions being made.

The post Assessing the effects of generation using age-period-cohort analysis appeared first on Decoded.

Can machines compete with humans in transcribing audio? A case study using sermons from U.S. religious services

Janakee Chavda — Thu, 11 May 2023 14:58:39 +0000

Pew Research Center illustration

A 2019 Pew Research Center study and follow-up study in 2020 involved the complicated task of transcribing more than 60,000 audio and video files of sermons delivered during religious services at churches around the United States. The primary goal of this research was to evaluate relatively broad topics discussed in the sermons to determine if there were any notable patterns or denominational differences in their length and subject matter.

The huge number of audio and video files meant that it would have been too time-consuming and expensive to ask humans to transcribe all the sermons. Instead, we used Amazon Transcribe, a speech recognition service offered by Amazon Web Services (AWS). We hoped to identify the key themes in the sermons we collected, even if the machine transcriptions were not perfect or at times lacked elements like punctuation that would often come with a traditional human transcription service.

Overall, the machine transcriptions were legible. But we did run into a few challenges. The Amazon service did not always get specific religious terminology or names right. (A few examples included “punches pilot” instead of “Pontius Pilate” and “do Toronto me” in lieu of “Deuteronomy.”) There were also some recordings for which the machine transcription was simply of low quality across the board.

A notable body of research has found that machine transcription sometimes struggles with certain accents or dialects, like regional Southern accents and African American English (AAE). This led us to wonder if the errors we were seeing in the machine transcripts of sermons was coincidental, or if we were encountering performance biases that could be making some transcriptions more reliable than others in a way that might affect the conclusions of our research.

Since we downloaded our sermon files directly as audio or audio/video, we lacked an original written transcript to compare against the machine-transcribed text. Instead, as a test, we asked a third-party human transcription service to tackle portions of some of the sermons that Amazon Transcribe had already transcribed and then compared the results between the two.

What we did

For this experiment, we were interested in using sermons that included a variety of regional accents and dialects among the speakers. One obvious challenge, however, was that we didn’t know much about the speakers themselves. We knew the location of the church where the sermon was delivered, as well as its religious tradition, but these were not necessarily sufficient to assign an accent or a dialect to the person speaking in a recording. We could only use these features as approximations.

With that caveat in mind, we focused the analysis on audio files from the four main religious traditions for which we had a reportable sample size: mainline Protestant, evangelical Protestant, historically Black Protestant and Catholic. We also examined three large geographic regions: the Midwest, the South and a combined region that merges the Northeast and the West (again to account small sample sizes in those two regions).

We took a stratified random sample of 200 sermons from churches for each combination of religious tradition and region, proportional to the number of sermons each church had in the dataset. From this sample of full audio files, we took one random snippet of audio with a duration of 30 to 210 seconds from each file and sent those audio snippets to our external human transcription service. This service was a standard online provider that claimed to have native language speakers, a multistep quality check process and experience transcribing religious content, including sermons specifically. At the end of this process, we had a total sample size of 2,387 texts with both machine and human transcriptions.

How we compared transcriptions

There are a variety of computational methods to measure the similarity or difference between two sets of text. In this analysis, we used a metric known as Levenshtein distance to compare our machine and human transcriptions.

Levenshtein distance counts the number of discrete edits – insertions, deletions and substitutions – at the character level necessary to transform one text string into another. For example, if the word “COVID” is transcribed as “cove in,” there is a Levenshtein distance of three, as the transformation requires three edits: one edit to add a space between the “v” and the “i,” one edit to add an “e” after the “v,” and one edit to substitute the “d” for an “n.”

Levenshtein distance is useful as a comparison metric because it can be normalized and used to compare texts of different lengths. It also allows for nuance by focusing on character-level edits rather than entire words, providing more granularity than something like simple word error rate by scoring how incorrect a mistranscription is.

As a final bit of housekeeping, we standardized both our machine and human transcriptions to make sure that they matched one another stylistically. We transformed all the text into lower case, spelled out numbers and symbols when appropriate, and removed punctuation, filler words and words associated with vocalizations (such as “uh” or “ooh”). We also removed the “[UNINTELLIGIBLE]” annotations that the human transcription service included at our request to flag cases in which someone was speaking but their words couldn’t be clearly understood.

Results

Across all the audio files we evaluated, the average difference between machine transcriptions and human transcriptions was around 11 characters per 100. That is, for every 100 characters in a transcription text, approximately 11 differed from one transcription method to the other.

We were also interested in looking at the difference across religious traditions and geographical regions. To do so, we used pairwise t-tests to test for differences in means across all religious traditions and all regions. (We did not calculate comparisons between each religious tradition and region combination after determining the interaction of the two variables was not statistically significant.)

The analysis found a small but statistically significant difference in Levenshtein distances between machine and human transcriptions for several religious traditions. Text taken from Catholic sermons, for example, had more inconsistency between transcripts than was true of those taken from evangelical Protestant sermons. And sermons from historically Black Protestant churches had significantly more inconsistency in transcriptions when compared with the other religious traditions.

While these differences were statistically significant, their magnitude was relatively small. Even for historically Black Protestant sermons – the tradition with the largest mismatch between machines and humans – the differences worked out to around just 15 characters per 100, or four more than the overall average. It’s also important to remember that we cannot assume the speaker is speaking AAE simply because the sermon was given in a historically Black Protestant church.

One expectation we had going into this experiment is that machine transcription would perform worst with Southern accents. However, we found that transcriptions of sermons from churches in the Midwest had significantly more inconsistency between machine and human transcriptions than those in other regions. Anecdotally, it appears this discrepancy may be because human transcribers had more difficulty than machines in understanding speakers in the Midwest: Sermon texts from the Midwest that were transcribed by humans included a greater number of “[UNINTELLIGIBLE]” annotations than those from other regions. There may also be other factors affecting transcription quality that we cannot account for, such as the possibility that sermons from the Midwest had systematically worse audio quality than those from other regions.

Again, although these differences were statistically significant, their magnitude was relatively small. Midwestern sermons, despite having the greatest inconsistency across regions, had only two more character differences per 100 characters than the overall average.

Conclusions and suggestions

In social science research, automated transcription services have become a popular alternative to human transcription because of the costs and labor involved in the latter. All in all, we found that the machine transcriptions and the human transcriptions used in this experiment were comparable enough to justify our decision to use an automated service in our research on U.S. sermons.

However, our experience does suggest a few ideas that researchers should keep in mind should they find themselves in a similar situation.

First, issues with transcription quality can be tied to the quality of the audio being transcribed – which presents challenges for humans and computers alike. By the same token, machine transcription may perform worse or better on certain accents or dialects – but that’s also true for human transcribers. When working with audio that has specialized vocabulary (in our case, religious terms), human transcribers sometimes made errors where machines did not. This is likely because a robust machine transcription service will have a larger dictionary of familiar terms than the average person. Similarly, we found that humans are more likely to make typos, something one will not run into with machine transcription.

More generally, reliability is usually an advantage of machine transcription. Human transcription can vary in quality based on the service used, and possibly from one transcript to another if there are multiple human transcribers. But the reliability of machine transcription can sometimes backfire. When presented with a segment of tricky audio, for example, humans can determine that the text is “unintelligible.” A machine, on the other hand, will try to match the sounds it hears as closely as possible to a word it knows with little to no regard for grammar or intelligibility. While this might produce a phonetically similar transcription, it may deviate far from what the speaker truly said.

Ultimately, both machine and human transcription services can be viable options. Beyond the obvious questions of budget and timeline that are often primary considerations, we would suggest evaluating the nature of the audio files that are being analyzed before transcription begins. Audio of mixed quality, or which features competing sound from an audience, can be tricky for humans and machines alike.

Researchers should also determine how important it is to have formatting and punctuation in the text they hope to analyze. Our researchers found that the lack of these elements can be a key barrier to understanding the meaning of a particular piece of text quickly. In our case, it wasn’t an insurmountable barrier, but it certainly added a significant cognitive burden to tasks like labeling training data. And it might have posed an even bigger problem had our analysis relied more heavily on unguided methods for identifying our topics of interest.

The post Can machines compete with humans in transcribing audio? A case study using sermons from U.S. religious services appeared first on Decoded.

Identifying partisan ‘leaners’ in cross-national surveys

Janakee Chavda — Fri, 05 May 2023 17:00:23 +0000

Pew Research Center illustration

In the United States, Pew Research Center measures partisan identification by asking Americans a pair of questions. The first question asks: “In politics today, do you consider yourself a Republican, Democrat, an independent or something else?” The second question is asked of those who identify as an independent or something else in the first question. It prompts: “As of today, do you lean more to the Republican Party or more to the Democratic Party?”

About four-in-ten Americans consider themselves an independent or something else, but the vast majority of this group typically express a preference for one major party or the other. That means that around a third of U.S. adults overall can be thought of as political “leaners” – that is, people who describe themselves as an independent or something else, but still favor one party in particular.

This approach to measuring partisan identification in the U.S. works well because the nation’s political system has only two dominant political parties. But the same task is more challenging in countries with more than two prominent parties.

In cross-national surveys, the Center measures party affiliation by asking respondents instead: “Which political party do you feel closest to?” Interviewers then record the responses using a pre-coded list of options, including an option for respondents to say they don’t identify with any party. Among the countries surveyed by the Center in 2022, the share who say they do not identify with a specific party ranges from 7% in Sweden to 54% in Japan.

Given the Center’s experience with measuring partisan identification in the U.S., we experimented with a new follow-up question in some of the countries we surveyed in 2022. Most of these countries do not have a two-party system, so we structured the new question as follows: After asking people which party they feel closest to, we asked those who did not provide the name of a party whether there is one party they feel closer to than others. For those who said yes, they were asked which one, allowing us to capture the international equivalent of U.S. “leaners.”

This post will walk through the results of that experiment and its implications for our future work.

Do people in other countries lean toward certain political parties?

Across the 17 non-U.S. countries included in our experiment, there was wide variation in the share of adults who indicated that they lean toward a specific political party. In Hungary, Poland and Israel, for example, only 3% said they lean toward a particular party. But in Japan – the sole country where more than half of people offered no partisan affiliation when asked the first of our questions – 17% said they lean toward a party.

How are leaners different from partisans?

A key question, then, is how leaners (those who don’t closely affiliate with a party but do have a preference for one) differ from partisans (those who actively identify with a party when first prompted). It is worth noting that the sample size for leaners is relatively small in many countries.

Generally speaking, leaners are less likely than partisans to see their country’s political system as effective. For example, in the United Kingdom, leaners are about half as likely as partisans (23% vs. 41%) to say their country’s political system allows people like them to have a great deal or a fair amount of influence on politics.

Leaners also tend to hold certain democratic processes in lower regard than partisans in some countries. In Spain, 62% of leaners say voting is very important for being a good member of society, while 81% of Spanish partisans say the same.

The two groups also differ in other respects, including ideology. In many countries, leaners are much more likely to describe themselves as politically moderate than partisans. In the U.S., 52% of leaners place themselves at the center of the ideological spectrum (“moderates”), compared with 31% of partisans. Partisans, in turn, are more likely than leaners to place themselves on the ideological left (“liberals”) or right (“conservatives”).

In a few countries, leaners also skew somewhat younger than partisans. For example, 29% of leaners in Singapore are under age 30, compared with only 15% of partisans. And, in many countries, partisans are more likely than leaners to be ages 50 and older. For instance, 47% of Spanish partisans are in this age group, compared with 28% of Spanish leaners.

In some countries, leaners are somewhat more likely than partisans to be women – as in the case of the Netherlands, where 66% of leaners are women, compared with 49% of partisans. However, there are some exceptions to this pattern, as in the U.S., where leaners are somewhat more likely than partisans to be men (51% vs. 46%).

Notably, leaners and partisans do not tend to differ significantly when it comes to educational attainment.

Do people who lean toward a party have similar attitudes as partisans of that party?

The next question to evaluate is whether leaners and partisans of the same political party hold similar attitudes, including in their views of political parties, democracy, the state of their nation’s economy and more.

In the U.S., independents who lean toward a party generally have similar views as those who directly affiliate with the same party, though leaners are less emphatic in their support for that party. For example, 54% of Democratic leaners have a favorable view of the Democratic Party, compared with 84% of those who identify as Democrats. Similarly, 59% of Republican leaners have a favorable view of the Republican Party, compared with 87% of those who identify as Republican.

Some of these patterns are evident among leaners and partisans in the other countries surveyed. In most places, leaners tend to have less positive views of the party they back than partisans do. In Canada, 45% of Liberal Party leaners have a positive view of the Liberal Party, compared with 82% of Liberal partisans. The same is true when it comes to views of Canada’s New Democratic Party (NDP) among both NDP leaners (76% favorable) and NDP partisans (92%), as well as views of the Conservative Party among Conservative leaners (68% favorable) and Conservative partisans (83%).

As in the U.S., leaners internationally often have less negative views of other parties than partisans do. Most often, this is the case when it comes to evaluations of the party in power. For example, in Canada, Conservative leaners are about twice as likely as Conservative partisans (31% vs. 16%) to have a favorable view of the ruling Liberal Party.

Leaners and partisans also differ in their satisfaction with democracy and their views of whether their nation’s economy is in good shape. In the UK, Conservative Party leaners are less likely than Conservative partisans to say they are satisfied with democracy in their country (50% vs. 74%) and that the British economy is in good shape (26% vs. 45%). This is particularly the case when it comes to leaners and partisans who favor the governing party. In a few countries, given the political nature of opinions about the state of democracy and the economy, opposition party leaners can be more positive about things than their partisan counterparts are. Take the example of the UK again: 41% of Labour Party leaners say the economy is in good shape while only 25% of Labour partisans say the same.

How would including leaners in our analyses change our findings?

One of the main ways we use data on political party affiliation is to look at how attitudes on issues vary among supporters of different parties. For example, we might want to know whether people who support different political parties differ in how much they trust their government to do what is right for the country. Now, because of the new cross-national leaner question we included in 2022, we can consider analyzing the leaners with the partisans. This would potentially give us more statistical power to detect cross-party differences, because of the larger sample sizes, but also might change some of the substantive findings, given that leaners and partisans have some notable demographic and attitudinal differences, as noted above. So, how would including leaners affect our findings?

Results indicate that views of key issues – including evaluations of the economy, democracy, the government’s handling of COVID-19 and the importance of voting – do not change significantly when leaners are added to partisans of the same party. For example, 74% of Social Democratic Party (SPD) leaners and SPD partisans in Germany are satisfied with democracy in their country, compared with 75% of SPD partisans alone. Among those who back Alliance 90/The Greens in Germany, 86% of leaners and partisans are satisfied with democracy, as are 86% of partisans alone.

Likewise, in Canada, 90% of Liberal partisans and leaners believe voting is very important for being a good member of society, while the same share of Liberal partisans alone hold this view. And 85% of Conservative leaners and partisans in Canada believe voting is very important, compared with 86% of Conservative partisans on their own.

Conclusion

Ultimately, this investigation shows there is some value to including a question about party leaning in our international surveys. It generates more people for analysis in some countries, thus allowing for greater power in statistical tests. And leaners resemble partisans in some important ways. Plus, in the U.S., Pew Research Center’s standard for analysis in most cases is to show results for “leaned party” (partisans and leaners together), so including this question may allow us to harmonize that approach cross-nationally.

But adding the question is not without cost on a crowded survey. And in some countries, the additional sample size generated by the follow-up question does not meaningfully affect our ability to report on specific political parties. All told, we’ll need to weigh these competing factors when deciding whether to include the leaner question on future cross-national surveys.

The post Identifying partisan ‘leaners’ in cross-national surveys appeared first on Decoded.

How we review code at Pew Research Center

Janakee Chavda — Wed, 05 Apr 2023 15:59:13 +0000

Pew Research Center illustration

At Pew Research Center, we work hard to ensure the accuracy of our data. A few years ago, one of our survey researchers described in detail the process we use to check the claims in our publications. We internally refer to this process as a “number check” – although it involves far more than simply checking the numbers.

As part of our quality control process, we also verify that the code that generates results is itself correct. This component of the process is especially important for our computational social science work on the Data Labs team because code is everywhere in our research, from data collection to data analysis. In fact, it is so central that on Labs we have spun off the review of our code into two distinct steps that precede the number check: a series of interim reviews that take place throughout the lifetime of a project, and a more formalized code check that happens at the end. This is our adaptation of what software developers call “code review.”

This post describes that process – how it works, what we look for and how it fits into our overall workflow for producing research based on computational social science.

Interim reviews

We have learned over the years that a good quality control process for code should move alongside the research process. This helps avoid building our results on top of early mistakes that accumulate and are difficult to fix right before publication. As a result, all projects receive at least one interim review well in advance of having a finished report in hand.

What we look for

The main goal of our interim reviews is to serve as midway checks to ensure that code is readable and properly documented – that is, to make sure it’s easy for other researchers from our team and elsewhere to understand. In that sense, the code reviewer stands in for a potential future reader and points out areas in which the code could, for instance, be easier to follow or need extra documentation.

These are the kind of things that we check for:

Is the code organized using our internal project template? We use a custom cookiecutter template to make sure that all our repositories have clear, predictable folder names and that they always include standard integrations.
Are all the scripts included in the GitHub repository? Does the code load all the packages it needs to run? Are the datasets it uses stored in locations other researchers on the team can access?
Do all scripts include a top-level description of what they accomplish? Are all scripts organized sensibly? Does the description of every function match what the function does? Are all inputs and outputs for functions and scripts documented correctly?
Are all object names sufficiently informative? Do they follow our internal style guide? For instance, we follow the tidyverse style guide for R, which defines a common standard to determine good and bad names for things like functions and variables (such as a preference for snake_case versus camelCase).
Is the code idiomatic and easy to follow? Does it use custom functions that could be replaced with more standard ones? Does it use trusted external libraries? If the analysis consumes a lot of computational resources, are there changes that could make the code more efficient?

How interim reviews benefit our work

There are two main advantages to having these small, periodic checkpoints during the research process. First, they help reduce the burden of the final review by making sure that the code is intelligible along the way. The interim reviews create opportunities for researchers to clean up and document code in manageable chunks.

Second, and more importantly, these interim reviews can catch errors early in the research process, long before we get to a final review. By focusing on whether someone else can follow what is happening throughout the code, these reviews are also excellent opportunities for the authors to confirm that their technical and methodological choices make sense to another researcher.

Things to watch for

In our experience, this level of review calls for a balance that is sometimes tricky to find. On the one hand, the reviewer is in a good position to assess whether the code is legible and complies with our internal technical standards. On the other hand, we all have our own personal styles and preferences.

Some of us prefer to use functions from base R while others favor the tidyverse. And while some think that structuring the code in classes and methods is appropriate, others may find it unnecessarily complicated. It can be hard to set aside our own personal tastes when we are commenting on other people’s work. For that reason, we try to distinguish between required changes (for instance, if a function is improperly documented or if code deviates from our internal style guide) and suggested changes (which the researcher may choose to follow or ignore).

The code check

The code check is a thorough and more systematic review of all the project code that takes place before the number check. It is the level of review that verifies the accuracy of the code that produces the published results. It’s also our last opportunity to catch errors, and, because of that, it is a very intense process.

What we look for

The code reviewer evaluates the project code line by line, assessing whether it does what the researchers say it does. Every code reviewer performs this review in their own way, but it’s useful to think of this process as evaluating three different things:

Do the researchers’ analytical choices make sense? (Is the method they want to use a sensible way of analyzing the data given their research question?)
Did the researchers correctly implement their analytical choices? (Are they using a function that performs the method as it’s supposed to?)
Does the data move through each step in the intended way? (Are the inputs and outputs of each step correct?)

Pew Research Center uses a wide range of technical and analytical approaches in our projects, so it’s almost impossible to come up with an exhaustive list of things that a reviewer must check for. The important thing is to make as few assumptions as possible about what is happening in the code – taking sufficient distance to be able to spot mistakes. For instance, in an analysis that combines survey results with digital trace data from, say, Twitter, these are things a reviewer could check:

Is the code reading data from the correct time period in our database? Is it using the correct fields? Did the researcher merge the two sources using the correct variable and the correct type of join?
Is the code dropping cases unexpectedly? Does it generate missing values and, if so, how do they affect the statistical analysis? If we are collapsing categorical variables, are the labels still correct? Because we often do one-off analyses, it’s not always worth writing formal unit tests for the data, so the reviewer has to manually check that the input and output of each function transforming the data is sensible.
Does the data we’re using align with our research goal? Do the researchers make unwarranted assumptions in the methods they’re using? Do the calculations use the correct base and survey weight? For instance, we frequently study how people use social media sites such as Twitter by linking respondents’ survey answers to their public activity on the platform. However, not every respondent has a Twitter account and not every respondent with a Twitter account will consent to the collection of their Twitter activity. When we report these results, the reviewer has to carefully check the denominators in any proportions: Is this a result about all respondents, respondents who use Twitter, or only respondents who consented for us to collect their behaviors on the site?
Does the code reproduce the same results that the researchers report when it is executed on a clean environment by the reviewer?

How code checks benefit our work

The code check is always done by a team member who was not involved in the research project in question. A fresh pair of eyes always provides a valuable perspective – for example, about what the researchers want to do and whether their code serves those purposes. In that sense, the code check is our final opportunity to look at a research project as a whole and evaluate whether the decisions we made along the way make sense in retrospect and were implemented correctly.

Having a new researcher look through the whole code has other advantages. The reviewer is expected to run the code top to bottom to confirm that it produces all the figures that we will report. It sounds trivial, but that’s not always the case. It means verifying that that all the code is included in the repository, that all dependencies and their installation are documented properly, that raw data is stored in the correct place, and, finally, that, after running on a new environment, it produces the exact same results that will appear in the final publication. In other words, the final code check tests the reproducibility of our research.

Things to watch for

The purpose of the code check is to evaluate whether the code works as intended or not. Even more than during the interim reviews, reviewers need to focus on evaluating whether the code works as it should and forgo comments about things they might have done differently. By the time the code reaches the code check stage, reviewers should not think about performance or style. Those discussions can happen during the interim reviews instead.

That said, code reviewers are certainly welcome to offer recommendations or suggestions at the code check stage. Learning opportunities abound during this process. But they are asked to be very explicit at this point to distinguish “must-haves” from “nice-to-haves.”

How we use GitHub

We take full advantage of git and GitHub for our code review process. Both the interim reviews and the code check are structured as pull requests that move code from personal branches to shared ones. By doing that, we know that the code in our shared branches (dev during interim reviews, and main after the code check) have been looked at by at least two people, typically including someone who is not a researcher on the project in question. These pull requests are also good starting points to make additional contributions to the project. You can learn more about how we use branches in our Decoded post on version control.

But the major advantage of using GitHub in this process is its interface for pull requests, which allows for discussion. The reviewer can leave comments to alert researchers about potential issues, request additional documentation or add suggestions – always pointing to specific lines of code and with a clear decision at the end (that is, whether the code is OK as it is or whether the authors need to make some edits). If the researchers make changes, the reviewer can see them in the context of a specific request. If they don’t make changes, they can start a discussion thread that gives everyone on the project additional documentation about some of the less obvious technical and methodological choices. This back-and-forth between the reviewer and the researchers continues for as many iterations as needed, until the reviewer’s concerns have been addressed and the code moves on to the corresponding shared branch.

Conclusion

Collectively, these quality control steps help us identify errors. Sometimes these errors are small, like using “less than” instead of “less than or equal to.” Sometimes they’re big, like using the wrong weights when analyzing a survey.

But these steps also help us produce higher-quality work in general. Writing code knowing that other people will need to read, understand and run the code results in clearer, better-documented code. Every moment of review is an opportunity to confirm the quality of the research design. The process is time-consuming, but it’s worthwhile – not just because it ensures that the code is correct, but also because it provides built-in moments to stop, evaluate and discuss the project with others. This results in research that is more readable, reproducible and methodologically sound.

The post How we review code at Pew Research Center appeared first on Decoded.

Nonresponse rates on open-ended survey questions vary by demographic group, other factors

Janakee Chavda — Tue, 07 Mar 2023 15:59:58 +0000

Pew Research Center illustration

Among the tools that survey researchers use to gain a deeper understanding of public opinion are open-ended questions, which can powerfully illustrate the nuance of people’s views by allowing survey takers to respond using their own words, rather than choosing from a list of options. Yet we know from existing research that open-ended survey questions can be prone to high rates of nonresponse and wide variation in the quality of responses that are given.

We wanted to learn more about Americans’ willingness to respond to open-ended questions, as well as differences in the substantive nature of the responses they provide. In this analysis, we examine the extent to which certain characteristics – including demographic factors and the types of devices that Americans use to take our surveys – are associated with respondents’ willingness to engage with open-ended questions, as well as the length of the responses they offer. This analysis is based on an examination of open-ended responses from Pew Research Center’s American Trends Panel (ATP), a panel of more than more than 10,000 U.S. adults who are selected at random and take our surveys in a self-administered, online format.

What we did

We examined nonresponse patterns by observing the median rate of nonresponse to a set of open-ended questions asked on several ATP surveys. (We defined “nonresponse” as skipping a question entirely. As a result, we counted even insubstantial entries such as “??” as responses and did not include them in this analysis. Had we included such responses, the nonresponse rates documented here would be higher.) We also analyzed the length of responses among those who did provide answers to these questions.

But first, we had to prepare our sample of data for this analysis. We began by taking a census of the open-ended questions asked on the 58 ATP surveys conducted between November 2018 and September 2021. To be considered in this analysis, a question had to be: 1) asked as part of a general population survey, meaning the survey was sent to a sample intended to be representative of the adult population of the United States; and 2) given to the entire sample, with every respondent given the opportunity to answer the question. Using these criteria, we identified 30 open-ended questions from 26 surveys, covering a variety of topics, that could help us examine our core questions.

Measuring nonresponse rates

We identified respondents who skipped open-ended question prompts by using a vendor-created variable that identified when a respondent left an empty response box for an open-ended question.

(N who did not respond to prompt / N asked question prompt) x 100 = Nonresponse rate

Using this variable, we divided the number of respondents who did not respond to the prompt by the total number of respondents in the sample, and we multiplied the answer by 100. We called this percentage our “nonresponse rate” for each question.

We also calculated nonresponse rates, using the same formula, across key demographic groups. We examined differences by gender, age, race and ethnicity, educational attainment, partisan identification, political ideology, voter registration, self-identified urbanicity and the type of device the respondent used to take the survey.

We also examined whether the rates were consistent across questions that prompted shorter or longer responses. We categorized the open-ended questions into three groups based on the prompted length of the response: 1) one-word responses; 2) short sentences; or 3) detailed descriptions in multiple sentences.

Measuring response length

We identified 19 questions that asked for responses of short or multiple sentences. For these questions, we recorded the average character count of the responses provided for each question. Using these averages, we evaluated how the length of responses differed between groups of respondents. (Note: For this analysis, we lightly cleaned responses to eliminate characters such as extra spaces or repeated punctuation marks. Different decisions about how to clean responses could lead to slightly different results than those we report here.)

What we found

Nonresponse rates vary across questions

Across all 30 questions examined in this study – including those asking for just a single word and those asking for detailed descriptions – nonresponse rates ranged from 4% to 25%, with a median of 13%.

Women, younger adults, Hispanic and Black adults, and people with less formal education were less likely to respond to open-ended question prompts than men, older adults, White adults and those with more years of education, respectively.

There was no difference in the median nonresponse rate by political party, with nonresponse rates of 13% for Republicans and Republican-leaning independents and for Democrats and Democratic leaners. But those who reported not leaning toward either party had a much higher median rate of nonresponse (31%).

Nonresponse rates were also higher among those who used a tablet or cellphone to take a survey (15% for each group) than among those who used a desktop or laptop computer (11%).

These aggregate patterns of nonresponse were observed across most of the individual questions examined in this study. (While this analysis was based on unweighted data, analysis that uses data weighted to population benchmarks for each survey sample produces substantively similar results.)

Prompts that ask for more detailed responses have higher nonresponse rates

Questions that asked for a quick, one-word response and those that asked for a short sentence both had median nonresponse rates of 12%, while those that prompted for multiple sentences had a higher median nonresponse rate (17%).

However, demographic differences in nonresponse remained fairly consistent regardless of question type. For example, for all three types of questions – those asking for one word, those asking for a short sentence and those asking for multiple sentences – Americans with a high school diploma or less education were much less likely to respond than Americans with a postgraduate degree.

Marginal differences in response length across most demographic groups

Among those who did respond to open-ended prompts, there were only small differences in response length across most demographic groups. However, adults with higher levels of formal education and those who completed a survey using a computer tended to offer longer responses to questions that asked for at least a short sentence than other respondents.

Analyzing responses across the 19 questions that asked for either short sentences or for detailed explanations, we found that adults with a postgraduate degree or a bachelor’s degree offered slightly longer responses (averaging 107 characters and 102 characters, respectively) than those with only some college experience (89) or a high school diploma or less education (74).

There were also notable differences in response length by device type. Generally, responses provided via desktop or laptop computers (113 characters, on average) were longer than responses provided on a tablet (87) or mobile device (80).

This dynamic is evident in responses to the following open-ended question, which was asked in a June 2019 survey: “In your own words, what does ‘privacy’ mean to you?” The average response length for this question was 89 characters for those using computers, 70 for those using tablets and 59 for those using mobile devices. Example responses to this question looked like this:

Substantively, these responses are not too different, but longer responses often provide more clarity and may mention more concepts.

Conclusion

Open-ended questions are an important tool in survey research. They allow respondents to express their opinions in their own words and can provide nuance to inform interpretation of closed-ended questions. However, researchers should examine differential nonresponse patterns to these questions, as well as differences in the substantive answers that are provided.

This analysis shows that people with lower levels of formal education, younger people, Black and Hispanic people, women and those less engaged with politics are less likely to provide responses to open-ended questions – and that some of these groups offer shorter responses on average when they do answer prompts. While the magnitude of these differences is often modest, researchers should consider that their findings may disproportionately reflect the views of those who are more likely to respond and to provide detailed responses.

The post Nonresponse rates on open-ended survey questions vary by demographic group, other factors appeared first on Decoded.

Testing survey questions about a hypothetical military conflict between China and Taiwan

Janakee Chavda — Thu, 02 Mar 2023 17:00:44 +0000

Pew Research Center illustration

Crafting survey questions about geopolitics can be difficult, particularly if one is interested in learning how people would respond to a hypothetical, as-yet unrealized situation. Pew Research Center has done this in the past by asking people in NATO member states about Article 5 of the NATO treaty: specifically, whether their country should use force to defend a NATO ally if that ally got into a serious military conflict with Russia. But as the current war between Ukraine and Russia highlights, the complexities of how a conflict breaks out – e.g., who is seen as the aggressor, whether an incursion is seen as provoked, whether “defending” an ally should involve boots on the ground – can affect public attitudes.

Given these concerns, how might wording affect responses to a question about a hypothetical conflict between China and Taiwan? We decided to find out by testing multiple formulations in an experiment using a non-representative, opt-in online survey panel with American adults.

We wanted to understand a few things through this experiment. First, if we simply described a hypothetical clash between China and Taiwan as a “military conflict” – as we did in the case of the NATO question above – would people think about the question similarly or differently from a question that asks about a specific aggressor, such as China invading Taiwan? Second, given the carefully calibrated “status quo” balance across the Taiwan Strait, we wanted to understand how Americans might respond to a hypothetical situation in which China invades Taiwan, versus one in which China invades Taiwan following a declaration of independence by Taiwan.

To investigate these ideas, we randomly assigned the following three questions to respondents in the United States:

If there were a military conflict between China and Taiwan, do you think our country should support China, support Taiwan or remain neutral? (We’ll refer to this question as “Conflict.”)
If China invaded Taiwan, resulting in a military conflict, do you think our country should support China, support Taiwan or remain neutral? (We’ll refer to this question as “Invade.”)
If China invaded Taiwan after Taiwan declared independence, do you think our country should support China, support Taiwan or remain neutral? (We’ll refer to this question as “Declare.”)

For each of these questions, we randomized the top two response options, meaning that some respondents saw the “support China” option first and others saw the “support Taiwan” option first. We consistently placed the “remain neutral” option at the bottom of the list of responses.

Results of the experiment indicate that participants do differentiate between these scenarios. When responding to the first question about a more generic “military conflict,” respondents are divided over remaining neutral (47%) and supporting Taiwan (45%). Very few choose the option of supporting China (8%).

When China is described as invading Taiwan, however, respondents are slightly more likely to want to support Taiwan (49%) than to remain neutral (44%). Again, very few choose the option of supporting China (7%).

The picture changes again when people are asked about a Chinese invasion that follows a declaration of independence by Taiwan. In this hypothetical, there is a slightly greater preference for neutrality (50%) over siding with Taiwan (40%). Only one-in-ten choose to support China.

It’s even clearer that preferences differ across these conditions when one looks at who prefers which option. Take partisanship as an example: Across all three questions, Democratic and Democratic-leaning independent respondents are somewhat more likely than Republican and Republican-leaning respondents to support Taiwan. But the gap between the parties is widest when the scenario involves Taiwan declaring independence. Under this hypothetical, 46% of Democratic respondents say the U.S. should support Taiwan, compared with 28% of Republicans.

There are also notable differences by age. Across all three scenarios, respondents under 30 have a slight preference for remaining neutral. But when they are presented with the hypothetical that China has invaded Taiwan without Taiwan having first declared independence, far more of these younger respondents support Taiwan (44%) than under the other two hypotheticals (29% in the “Conflict” scenario and 26% in the “Declare” scenario).

Respondents ages 60 and older are also particularly likely to support Taiwan in the scenario where it has been invaded without having first declared independence: Nearly nine-in-ten (86%) take this position. Around two-thirds of these older participants (68%) also support Taiwan when they are asked about a more generic “military conflict” between Taiwan and China. Notably, however, only around half of these older adults (48%) would support Taiwan if Taiwan has first declared independence.

Taken together, the results of this analysis suggest that a single question may be insufficient to capture how people feel about “supporting Taiwan,” given that support is somewhat contingent on what type of provocation, if any, there is in a potential conflict. We will need to take these complexities into account if we try to measure attitudes on this issue moving forward.

The post Testing survey questions about a hypothetical military conflict between China and Taiwan appeared first on Decoded.

How we keep our online surveys from running too long

Sara Atske — Thu, 08 Dec 2022 17:59:55 +0000

Pew Research Center illustration

The longer a survey is, the fewer people are willing to complete it. This is especially true for surveys conducted online, where attention spans can be short. While there is no magic length that an online survey should be, Pew Research Center caps the length of its online American Trends Panel (ATP) surveys at 15 minutes, based on prior research.

But how can we know how long a survey will take to complete before it’s been taken? This is where question counting rules come in. In this post, we’ll explain how Pew Research Center developed and now uses these rules to keep our online surveys from running too long.

In a nutshell, we classify survey questions based on their format. Each question format has a “point” value reflecting how long it usually takes people to answer. Formats that people tend to answer quickly (e.g., a single item in a larger battery) have a lower point value than questions that require more time (e.g., an open-ended question where respondents are asked to write in their own answers). A 15-minute ATP survey is budgeted at 85 points, so before the survey begins, researchers sum up all of the question point values to make sure the total is 85 or less.

The Center’s point system was developed using historical ATP response times. When administering an online survey, researchers can see how long it takes each respondent to answer the questions on each screen. This information has allowed us to determine, for example, that more challenging open-ended questions can take people minutes to answer, while an ordinary stand-alone question takes only about 10 seconds or so.

Question types

Stand-alone questions

Stand-alone questions are the most common type on the ATP. They are likely what comes to mind when someone imagines a classic survey experience – a straightforward question followed by a set of answer choices. In our counting scheme, these questions count as one point.

Battery items

Battery items are another popular form of survey question. Batteries refer to a series of questions with the same stem (e.g., “What is your overall opinion of…”), followed by a variety of answer options. Each answer option is assigned two-thirds of a point (0.67 points) in our question counting rules. For example, a battery of five items would count as 5 x 0.67, or 3.35 points in total.

Open-ended questions

Open-ended questions are among the most difficult and time-consuming for respondents to answer. They are also the most likely type for respondents to skip without answering. Open-ended questions require respondents to form answers themselves rather than selecting from the options that are provided. While a standard stand-alone question can be answered in a matter of seconds, open-ended questions often take a minute or more to answer. As a result, these questions have markedly higher points assigned to them in the question counting guidelines.

But not all open-ended questions are created equal. Over the years, we’ve noticed that some take much longer to answer than others. For example, questions like, “In a sentence or two, please describe why you think Americans’ level of confidence in the federal government is a very big problem” would take respondents much longer to answer than a question like, “Who would be your choice for the Republican nomination for president?” Questions that ask for more detail are assigned eight points, while questions that are more straightforward are assigned five points.

Check-all questions

Check-all questions ask the respondent to select all response options that apply to them. These questions can yield less accurate data than stand-alone questions, so they are rarely used on the ATP. Check-all questions count for two points.

Vignettes

Vignettes present respondents with a hypothetical scenario and then ask questions about that scenario. Typically, at least one detail of the scenario is changed for a random subset of the respondents, allowing researchers to determine how that detail affects people’s answers.

From a survey length standpoint, vignettes are notable for showing respondents a block of text describing a particular scenario. They entail more reading than the other question types. Vignettes are rare on the ATP and are budgeted one point for every 50 words in length.

Thermometer ratings

Feeling thermometers ask respondents to rate something on a scale from 0 to 100, in which 0 represents the coldest, most negative view and 100 represents the warmest and most positive.

They are budgeted for 1.5 points because they are more difficult than a question with discrete answer options. It can sometimes be challenging for respondents to map their opinion of someone or something onto a 0-100 scale.

Conclusion

Thanks to these question counting rules, the incentives we offer for survey completion and the goodwill and patience of our panelists, about 80% or more of the respondents selected for each survey complete the questionnaire. In other words, the survey-level response rate is usually 80% or higher. Among the narrower set of people who log on to the survey and complete at least one item, about 99% typically complete the entire questionnaire. That is, the break-off rate is usually about 1%.

The post How we keep our online surveys from running too long appeared first on Decoded.

What we learned from creating a custom graphics package in R using ggplot2

Sara Atske — Tue, 04 Oct 2022 19:48:00 +0000

Pew Research Center illustration

Creating informative and digestible data visualizations is a foundational aspect of Pew Research Center’s work. Traditionally, our report graphics have been generated by individual researchers using a custom Excel template developed by our design team, or on an ad hoc basis by graphic designers using professional tools and mock-up designs.

In recent years, many of our researchers have expressed an interest in using R as an integrated environment for data analysis and graphics development. R allows users to generate iterative graphics directly from analysis scripts, which is helpful for quickly evaluating different ways to visualize results and findings.

However, the standard graphics libraries in R look quite different from Pew Research Center’s in-house style. As a result, our R users typically had two options: produce graphics directly in R that would need extensive reworking prior to publication or leave the R environment entirely in order to create stylistically appropriate visualizations.

To address this, some of our researchers started developing custom versions of Pew-style graphics using ggplot2, a complex yet highly customizable data visualization tool in the tidyverse suite for R. But with multiple researchers developing their own solutions — often in personal folders, and by copy-and-pasting from other projects — errors and inconsistencies accumulated.

In an effort to solve some of these issues, we set out to develop a Pew Research Center-specific custom R package containing all of the standard graphics functions we use in our work in one easy-to-install package — similar to what a number of other research and journalism organizations have done. Our main goal was to create a package that allows researchers to quickly iterate between different data visualizations that are consistent with Pew style.

Our package, which we refer to internally as “pewplots,” is a ggplot2 wrapper made up of customized versions of existing ggplot2 functions. In practice, this means that ggplot2 plotting functions are wrapped within pewplots functions, and code is added or modified to extend the functionality of the original plotting function.

In addition to ensuring that every R developer is using the same official style, we also hoped this package would make data visualization in R more user-friendly for those who are newer to programming or coding. Making wrappers for ggplot2 functions does require a fair amount of technical know-how. But once complete, wrapper functions like this take much of the complexity off of individual users’ plates and make plot generation standardized, faster and more accessible.

We developed and continue to manage the code with the help of GitHub and Nexus repositories, which allow us to work collaboratively and make the package directly accessible to any R user on our internal server.

But beyond the technical details, what are the pros and cons of an effort like this? From our experience, we have a few takeaways and tips if you want to try something like this yourself.

Things to consider before you begin, and tips for development

These are some decisions we made early in the process that helped keep us on track, as well as some more general tips for development. We’d recommend spending some time on these issues before you start — as it’ll help you save time and stress down the line — and keeping them in mind throughout the development process.

Determine if a custom graphics package is the right solution for you. The biggest question in this process is probably whether the time needed to put together a custom graphics package is worth it. Ask yourself why having custom plotting functions in a package would be useful to you or your organization.

If you’re primarily working with your data in R, it can be a hassle to switch between R and some other graphics generation program (Excel, Tableau, etc.). But if you’re able to make your graphics directly in R, there’s no need to worry about exporting data, making sure that data is always up to date, or switching between applications. Also, if there’s a specific style guide for your graphics that you’re trying to follow, existing ggplot2 graphics are unlikely to tick all the boxes you need. If this sounds like your circumstances, a custom package might be a good option for you.

A custom package might also be a good solution for you if you’re interested in automation. Having plots and plot features built into custom functions automates a lot of nuance and details in specifying plot aesthetics and alignment. Instead of having the same 20 lines of code copied and pasted, with numerous detail adjustments in your analysis 10 separate times, you can use functions to store most of the redundant information. It’s also useful in case you have multiple people making graphics and you want to standardize those graphics across users.

Establish a baseline for what you want it to accomplish. Do you want only a theme, or do you want fully customized plot types? How much detail do you want to be default behavior, as opposed to the responsibility of the user?

One of the most important steps in the process is to figure out what features you need to include so your custom graphics package does what its users need and want it to do. At some point, it’s necessary to settle on a set of features you want to include to give yourself a goal, as well as a stopping point. There will probably always be more features you can add, but if you don’t determine the baseline of what you want to include, you might find yourself adding features but working toward no specific goal.

Decide if you want an internal or external release. The main question you want to ask here is whether you want to commit to code maintenance and development for external users. At Pew Research Center, we decided to keep our package for internal use only. It’s extremely beneficial to our workflows but isn’t necessarily something we’re interested in maintaining for external use, as we do with some of our other packages. That decision influenced how we designed our vignettes and other documentation, as well as the naming conventions used for many of the functions and colors/color palettes.

Get feedback from the people who will be using the package. While certain things may make sense to you as the developer, the most important thing is that the people that you’re making the package for are able to use it. User feedback is important for any product or service, and this is no different. Also, getting feedback from people who aren’t part of the development process can bring in new perspectives and help you catch things that you, as the developer, may not have noticed or considered. In our development process, it was especially important to get feedback from our design director to make sure that our graphics were up to par. Asking end users to test out the package allowed us to catch missing features and unclear documentation.

Test a lot. If you want your package to be flexible and versatile, you should test it on as many different data sources as possible. You might not even notice a feature is missing or broken until you’ve tested it on a specific data source or trying to make a specific kind of plot.

Challenges and pitfalls to watch for

Despite the benefits we’ve seen from creating our custom graphics package, our effort has not been without challenges.

Most significantly, we realized that an effort of this scale takes a lot of time and work. We developed v1.0.0 of pewplots over a period of around six months of consistent work. There’s also a relatively high barrier of entry. Just to begin, you need a fairly high level of general R knowledge, as well as knowledge about the tidyverse and ggplot2 more specifically.

In our case, it was also necessary for developers and researchers to be familiar with the ins and outs of Pew Research Center’s official style guide for graphics. In addition, the process required a decent understanding of the graphics that different teams within the Center make, including the different features and customizations those teams require.

All of that is to say that there is a gap between generating R graphics that complement our internal analysis and workflows and generating R graphics that are publication-ready. There are also still countless features we could add in terms of tweaks and customizations. One key consideration in the future evolution of our internal graphics package is distinguishing between features that we need to have — as opposed to ones that are nice to have — and generally preventing “scope creep” as we move forward.

The post What we learned from creating a custom graphics package in R using ggplot2 appeared first on Decoded.