Data-Science-Final-Project

Perrin’s Data Science Final Project

Idea 1: Gender gaps across school subjects at different levels of education overtime.

Questions:

Does the level of education effect the rate at which gender gaps close?
Does the level of education effect the time that gender gaps start to close? For example, it is plausible that a gender gap in computer science could start to close at the colege level a few years after it starts to close at the highschool level as the cohort of original highschoolers enter college.
How do gender gaps across different school subjects vary by region? Does this change overtime?
Potential Challenges:
Gender is a spectrum, but I will be shocked if I can find data about gender gaps that acknowledges this. Older data will probably record biological sex, and new data will probably record gender as man, woman, and other. Depending on who collected the data, the other category could just contain non binary and gender fluid people, or it could also contain trans people. Should I anlayse the extremely broad and undescriptive “other” category as its own gender?
Lots of subjects branch into more sub-subjects as education level progresses. Math in middle school can become math, computer science, and physics in highschool. Computer science in high school can become computer science, computer engineering, data science, cyber security, etc. in college. How can I accound for this in my analysis? How can i even track which subjects branch in what ways? Should I focus mostly on named college majors that match named classes in highschool?
Idea 2: (Probably) unconcious patterns in fiction writing

Questions:
How consistent are authors about the order in which they list their characters names?
For the authors that are consistent, is there any correlation between this order and the importance, gender, race, or age of chatecters?
How do the demongraphics of authors correlate to the demographics of major charaters with significant amounts of dialogue?
Potential Challenges:
This project may involve a lot of manual collection of data from text files. Some things, like the order of name lists and the amount of dialogue attributed to a character I might be able to scrape from text files myself. However, things like figuring out the demographics and importance of a character would be hard. I might be able to call on an AI, but I would probably have to pay to access their APIs. I generally think that unless I can find a really good data set for this, I might strugle to make my own big enough.
Idea 3: According to Our World In Data the price of lighting in the UK has fallen drastically since the 1300s. I want to explore possible causes and effects of this.

Potential Questions
What correlations exist between the cost of light and levels of education?
What correlations exist between the cost of light and innovation?
What correlations exist between the cost of light and GDP?
Week 10 Update:

This will be a solo project on gender gaps in classes and majors at Whitman, and how they are related to race. If the College Board responds to my data request, then I also hope to discuss how gender gaps at Whitman are related to gender gaps in high school AP classes.

I hope to use the following Data Sources:

Data Source 1: Whitman’s Institutional Research

I have contacted Neal Christopherson to ask for data, but I have not heard back yet. As a result, my Pros/Cons list is speculative.

Pros

I expect that the data will probably be quite complete and require minimal cleaning
The data was collected by a group of people I can contact easily. So, I will be able to answer questions about who created the data and why.
I will be able to clarify any parts of the dataset with confusing documentation.
Cons
The data will probably have rigid categories
Since I know people at Whitman personally, I will need to ensure that the data I ask for would not allow me to guess who the person it describes is.
Using data specifically about Whitman students means that I need to narrow the scope of my project to just focus on Whitman. I won’t be able to draw broader conclusions
Data Source 2: College Board

I have submitted a request for data about the demographics of AP classes to the college board. I don’t know if they will allow me to access this data or not. I will honestly be kind of suprised if they do.

Pros
This would be a massive dataset that would cover the entire nation. It would allow me to draw broader conclusions
I expect that this data set would be clean and complete
Cons
AP classes are more common in wealthier schools and the students taking AP classes are often of higher socioeconomic status. As a result my data would disproportionately represent wealthier students.
I might not get this data from the college board
The data will probably have rigid categories
Research Questions (That I could answer without College Board data)
How have gender gaps at Whitman changed overtime? Which majors/subjects have the largest gender gaps? Has this always been the case?
How does intersectionality affect gender gaps? Do gender gaps close for white people before they close for everyone else?
How much do gender gaps in classes within a department compare to the gender gaps in the majors?

Data-Science-Final-Project

Perrin’s Data Science Final Project

Idea 1: Gender gaps across school subjects at different levels of education overtime.

Questions:

Potential Challenges:

Idea 2: (Probably) unconcious patterns in fiction writing

Questions:

Potential Challenges:

Idea 3: According to Our World In Data the price of lighting in the UK has fallen drastically since the 1300s. I want to explore possible causes and effects of this.

Potential Questions

Week 10 Update:

Data Source 1: Whitman’s Institutional Research

Pros

Cons

Data Source 2: College Board

Pros

Cons

Research Questions (That I could answer without College Board data)