This is the Beginning
Rupaul’s Drag Race is my favorite show of all time and as a data analyst it’s always been my dream to use data about the show and queens to make observations about the shows contestants and predictions on their performance and castings. I knew that this had to start with the data, making it clean and accessible is the first step to enabling this super fan obsession. This project has been stewing for a long time, and has seen many private iterations and prototypes. My perfectionism though has prevented me from showing anything at all. I’ve realized the best way to start is just to start even if it isn't perfect. So here I am, starting a database of drag queens. The focus for now is contestants of RuPaul’s Drag Race, but ultimately I would like to make it a global repository for data on all drag queens, hence the name…
D R A G N E T
So why did I wait this long? Because its not easy! For a long time I thought this would be a simple project, I’m not even talking about the analytics I want do on the data, just collecting it! But as I have tried and struggled to do this, I’ve run into barriers I never expected. Working on this has at times tested the limits of my data munging abilities, how to transform features, and really exposed gaps in my knowledge about data governance and database creation. One of the really big questions I’m still working through is how to fairly quantify the success on contestants on one of the worlds most beloved shows. But I’m not waiting any longer so here is blog post one, and I’m starting with the meat and potatoes, the queens.
But First, Why?
Why do this? What does drag race mean to me and why is it so important? Rupaul’s Drag Race is my favorite show. There is a personal element to this, and a broader context in which I think its both fascinating and extremely important. When I was 20, I had only been out of the closet for a year when my roomate (and close friend) at the time asked me one evening if I wanted to watch Rupaul’s Drag Race. I remember knowing about drag race but always thinking “I’m not that gay.” My internalized homophobia was a toxic tenent in my new life as an openly out gay man. Thankfully, since my friend suggested watching the show first, I started watching the show and quickly, I became totally engrossed.
Drag Race is a show that embraces and celebrates effeminate gay men and now is broadening to the larger queer community, including our transgender family and hopefully will grow to include the rest of the community (I’m thinking of a show that’s an inclusive space for queer women too, maybe we can get a DRAG KING or BIO QUEEN on the show y'all???) For me it was the first time I ever got to see the type of queer person who society taught me to hate, being celebrated and elevated.
I grew up in Fayetteville, North Carolina. Its not just a small conservative city in the south, but also home to one of the largest military bases in North America. Military culture can be very conservative and perpetuates traditional, toxic ideals of masculinity. Drag Race was revolutionary to me. It showed me that I did not have to hate the parts of myself I had learned to destroy. It taught me how to use them as a source of strength, empowerment, and freedom.
The second really important thing about Drag Race also has to do with places like Fayetteville. I went to college in Seattle, Washington, and I lived in Capitol Hill, the “gayborhood” if you will of the city. In places like these, LGBTQ+ culture is accessible and around you. But if you’ve lived in a place like Fayetteville then you know what its like to live in a desert, with only small pockets of water around you that are never enough. Drag Race is changing that, making queer culture accessible anywhere. For many, it is all they have access to. And more than that, it is the foremost vehicle projecting queer visibility into the mainstream. Drag Race is all that exists for some and now is a major contributor defining mainstream representation of queer people. Through this optic, it is fascinating to watch the show. What do past seasons say about where we’ve been, and what does the show today say about where we are? I could go at length on this but hopefully you get the idea. Drag Race is important.
Queen DATA
But enough of my prattle, lets make this data! I’m going to start with the basic descriptive dataset of all the queens who’ve appeared on the show. This is the foundation of what will become the database and the most versatile of them all. My hope is to expand from Drag Race to more broadly record drag queens with active careers, other shows, and incorporate new data like social media as drag and the platforms we consume it on continue to evolve. For now though we’ll get the basic descriptive information on our contestants on RPDR.
First lets store the links that we’ll use to to SNATCH that data in a vector. I chose Wikipedia here as my source to scrape data from. If you can believe it or not, sources of data on Drag Race are scarce on the internet, the only other website I found containing similar comprehensive information on the show was this RPDR wiki. I looked for sources that had the information in tables that would be easy to scrape.
seasons = c("https://en.wikipedia.org/wiki/RuPaul%27s_Drag_Race_(season_1)",
"https://en.wikipedia.org/wiki/RuPaul%27s_Drag_Race_(season_2)",
"https://en.wikipedia.org/wiki/RuPaul%27s_Drag_Race_(season_3)",
"https://en.wikipedia.org/wiki/RuPaul%27s_Drag_Race_(season_4)",
"https://en.wikipedia.org/wiki/RuPaul%27s_Drag_Race_(season_5)",
"https://en.wikipedia.org/wiki/RuPaul%27s_Drag_Race_(season_6)",
"https://en.wikipedia.org/wiki/RuPaul%27s_Drag_Race_(season_7)",
"https://en.wikipedia.org/wiki/RuPaul%27s_Drag_Race_(season_8)",
"https://en.wikipedia.org/wiki/RuPaul%27s_Drag_Race_(season_9)",
"https://en.wikipedia.org/wiki/RuPaul%27s_Drag_Race_(season_10)")
as = c("https://en.wikipedia.org/wiki/RuPaul%27s_Drag_Race_All_Stars_(season_1)",
"https://en.wikipedia.org/wiki/RuPaul%27s_Drag_Race_All_Stars_(season_2)",
"https://en.wikipedia.org/wiki/RuPaul%27s_Drag_Race_All_Stars_(season_3)")
The next thing was to loop through each season, take the data and append it to a base table. You can see that I do this for each regular season and All Stars season. I also create a counter to keep track of which season the cast comes from. We’ll take a look at the first few rows from the main queens table that this creates.
library(rvest)
library(dplyr)
library(plyr)
library(tidyr)
count = 1
queens = data.frame()
bin = data.frame()
for(i in seasons){
bin = i %>% read_html() %>%
html_nodes(xpath='//*[@id="mw-content-text"]/div/table[2]') %>% html_table()
bin = bin[[1]]
bin$Origin.Season = count
queens <- rbind(queens,bin)
count = count + 1
}
count = 1
as.queens <- data.frame()
bin <- data.frame()
for(i in as){
bin = i %>% read_html() %>%
html_nodes(xpath='//*[@id="mw-content-text"]/div/table[2]') %>% html_table()
bin = bin[[1]]
bin$AS.Season = count
as.queens = rbind.fill(as.queens,bin)
count = count + 1
}
options(width=200)
head(queens)
## Contestant Name Age Hometown Outcome Origin.Season
## 1 BeBe Zahara Benet Nea Marshall Kudi Ngwa[3] 28 Minneapolis, Minnesota Winner 1
## 2 Nina Flowers Jorge Luis Flores Sanchez[4] 34 Bayamón, Puerto Rico Runner-up 1
## 3 Rebecca Glasscock Javier Rivera[5] 26 Fort Lauderdale, Florida 3rd Place 1
## 4 Shannel Bryan Watkins[6] 29 Las Vegas, Nevada 4th Place 1
## 5 Ongina Ryan Ong Palao[7] 26 Los Angeles, California 5th Place 1
## 6 Jade[a] David Sotomayor[8] 25 Chicago, Illinois 6th Place 1
As you can see, we extract the basics but the data has issues. Some of the fields have numbers and letters in brackets, artifacts of their extraction, and there is no separation between city and state/country in the hometown column. Lets tidy this up a bit and then look at what we have.
# making seasons factors
queens$Origin.Season = as.factor(queens$Origin.Season)
as.queens$AS.Season = as.factor(as.queens$AS.Season)
# separating hometown into a city and state / country column
queens = separate(queens, Hometown, into=c("City","State.Country"),sep=",",remove=TRUE)
as.queens = separate(as.queens, Hometown, into=c("City","State.Country"),sep=",",remove=TRUE)
# making names and real names into one column
as.queens = select(as.queens, -`Original placement`,-`Original season`)
as.queens$Name = ifelse(is.na(as.queens$Name),as.queens$`Real name`,as.queens$Name)
as.queens$`Real name` = NULL
# filtering out the abberant characters in the text columns
queens$Contestant = gsub("\\[(.*?)\\]","",queens$Contestant)
queens$Name = gsub("\\[(.*?)\\]","",queens$Name)
queens$Outcome = gsub("\\[(.*?)\\]","",queens$Outcome)
as.queens$Contestant = gsub("\\[(.*?)\\]","",as.queens$Contestant)
as.queens$Name = gsub("\\[(.*?)\\]","",as.queens$Name)
as.queens$Outcome = gsub("\\[(.*?)\\]","",as.queens$Outcome)
# reindexing the dataframes
queens = queens[c(7,1,2,3,4,5,6)]
as.queens = as.queens[c(7,1,2,3,4,5,6)]
options(width=200)
head(queens)
## Origin.Season Contestant Name Age City State.Country Outcome
## 1 1 BeBe Zahara Benet Nea Marshall Kudi Ngwa 28 Minneapolis Minnesota Winner
## 2 1 Nina Flowers Jorge Luis Flores Sanchez 34 Bayamón Puerto Rico Runner-up
## 3 1 Rebecca Glasscock Javier Rivera 26 Fort Lauderdale Florida 3rd Place
## 4 1 Shannel Bryan Watkins 29 Las Vegas Nevada 4th Place
## 5 1 Ongina Ryan Ong Palao 26 Los Angeles California 5th Place
## 6 1 Jade David Sotomayor 25 Chicago Illinois 6th Place
So now we have a basic descriptive data set on queens who’ve been contestants on the show RuPaul’s Drag Race. Like I said, eventually I would like to expand this data set to include queens from other shows like Dragula, and in general, working queens. Additionally, either in this dataset or another it will become relevant to include social media handles and follower counts as this becomes a relevant measure of a queens success.
Next Steps
The next two datasets I want create are
- A dataset of all the lip syncs that have occured on the show
- A dataset that tracks the progress and success in the competition of each queen on the show.
The second dataset is more complicated because the show does not adhere to a strict format. Queens leave and re-enter the competition, All Stars seasons exist, etc. Is there one fair, equitable way to quantify the performance of queens on the show? How does social media clout factor into this, especially if we ever want to study post show outcomes? We’ll grapple with this next. Until next time, forever and always,
MAX WARBURG
Share this Post