ted+image.jpg

SQL Project

In my Development Tools for Data Analysis class (OIDD 105), my teammates and I analyzed the content of TED Talks to compare talks given by men and those given by women.

Our dataset consisted of TED Talks posted online on www.ted.com/talks from 2015-2018 - about 1500 videos.

   WebScraper and Excel    We used  WebScraper  to scrape each Ted Talk’s title, video link, speaker name, speaker description, talk description, tags, and transcript. To make sure that we scraped all the tags (some were hidden behind a 3-dot button)

WebScraper and Excel

We used WebScraper to scrape each Ted Talk’s title, video link, speaker name, speaker description, talk description, tags, and transcript. To make sure that we scraped all the tags (some were hidden behind a 3-dot button), we used an element click selector.

We then cleaned the data in Excel and labeled each speaker with “M” or “F” using Excel formulas that checked if gender pronouns were mentioned in the speaker descriptions; this wasn’t comprehensive so we labeled the rest manually.

The tags of each talk indicated the topics of the talk. We wanted to choose tags that had a high number of occurrences (tags that were tagged in at least 150 talks), so that there’s a large enough “sample size” to compare men and women. We also classified tags into certain topic groups such as science, social, business, etc. The full list of tags/tag groups is here.

We also had to manually enter video tags that were, for some reason, not scrapped by WebScraper. We also excluded instructional videos without a main speaker, videos in other languages, and videos with performance groups (mix of men and women). This led our total dataset to be 1300+ videos.

   SQL & a bit more Excel    Since we split up the work of manual data entry for both tags and genders, we joined each of our tables together in SQL to get the complete dataset.  Afterwards, in SQL, we added columns for each of our tags/tag group

SQL & a bit more Excel

Since we split up the work of manual data entry for both tags and genders, we joined each of our tables together in SQL to get the complete dataset.

Afterwards, in SQL, we added columns for each of our tags/tag groups (there were 32 tag/tag group columns in total). To populate the column, we used a case statement - if a row (a.k.a. if a talk) contained a tag corresponding to the tag column , we would code it as 1.00 (a.k.a. this talk has this topic); otherwise we could code it as 0. For example, if a talk had “society,” “social change," “culture,” etc., and we were populating the values for the “social” tag group column, the value would be 1.00 for this talk. See here for the queries we executed.

We then transferred the data back to Excel in the form of a csv, where we counted the number of videos per tag and then found the proportion of men vs. women speakers for each tag/tag group. Then, we performed a z-test for proportions (p-value < 0.05). In the z-test, we compared the observed proportion with the expected proportion of 0.5, which is what the proportion would be if gender played no role in what topics the TED talks were on.

Here’s what we found:

   Transcript Analysis?    We had originally also wanted to perform analysis on the transcript (NLP). But, we realized that it’d be challenging for us to deem certain words as certain characteristics (e.g. statistical vs. emotional language); we also

Transcript Analysis?

We had originally also wanted to perform analysis on the transcript (NLP). But, we realized that it’d be challenging for us to deem certain words as certain characteristics (e.g. statistical vs. emotional language); we also didn’t have NLP backgrounds.

Instead, we did a simpler analysis comparing the usage of “I”/ “me” vs. “we”/“us” (including variations such as “I’ll”, “we’ll”, “I’m”, “we’re”, “we’d”, etc.) for men and women. We discovered that the ratio of “I”/ “me” to “we”/“us” for men is about 1.31:1, while for women it’s about 1.68:1. This finding was very interesting!

Potential limitations to our analysis:

We realize there might be some biases/limitations to our analysis because…

- our dataset includes only English-speaking TED speakers and non-Ted-Ed videos that are featured on the TED website
- the tagging decisions of the TED website developers affected our analysis - some videos had no tags, while others had 10+ tags
- there could also be some personal bias in the way we categorized certain tags in our tag groups
- in our transcript analysis, we did not account for videos that did not have a transcript
- the total dataset was 1300+ videos but the number of videos per tag/tag group was relatively small (usually <300)

Overall, this was a fun project to apply some of the skills I learned in class!