Social Media and Climbing

Interest in expeditions especially on the coveted summit, Everest, was slow to pick up over decades. But something changed in the late 1990s. While a direct correlation is not being drawn here, social media platforms also started around this period and truly took off in the 2000s.

Number of Expeditions per Year on Mount Everest. We see a sharp increase in the number of expeditions per year after the start of commercial expeditions in 1988.

Reddit was founded on June 23, 2005 and close on heels, Tweeters began posting in 2006. Reddit has two popular trekking related subreddits : Mountaineering and Alpinism besides the Everest Subreddit. From Reddit posts we will derive topics (intent discovery) and topic trends.

In Twitter we find veteran mountaineers who have summited Himalayan peaks multiple times such as Alan Arnett (@alan_arnette) and Kenton Cool (@KentonCool) who coach and motivate aspirants. In Tweets, we are looking for motivating factors that drive people to climb the Everest and perceptions people have of the trek up the Himalayan mountains.

To approach our understanding of how social media connects with climbing, 117,093 tweets were extracted from 2006 to 2023 using Snscrape and the Twitter API. The data extracted shows a mix of topics. Ones which are relating to our interest contain tones of pride:

However, metaphors and idioms create noise. Everest is used synonymously as the “unattainable” or “very massive” in common language and also appears in terms related to entertainment such as the National Geographic series on Everest and the Disney ride "Expedition Everest". Our challenge lies in first disambiguating Himalayan peak Everest related tweets to gather Tweeter interest, from the rest.

God I can't go out with Mount Everest on my cheek
Being sick is when crawling out of bed to walk 12 feet to the bathroom feels like climbing Mount Everest. Bed: warm. Everywhere else: cold.
Its like having a picture of mount Everest in ur mind and when u finally get to see it,its like olumo rock. Buh then again what do I know?
(index 14294) How in the world is there an ant on a roof . That's gotta be the equivalent to me climbing Mount Everest

We first must reduce this data set and remove tweets with obvious references to the entertainment domain.

Short Text Topic Modeling

The Gibbs Sampling Dirichlet Mixture Model (GSDMM) is an “altered” LDA algorithm, showing great results on Short Text Topic Modeling tasks, that makes the initial assumption: 1 topic describes 1 document. The words within a document are generated using the same unique topic, and not from a mixture of topics. By contrast a method like Latent Dirichlet Algorithm LDA uses a bag-of-words generative approach which draws from a probability distribution of topics in a document and a distribution of words in a topic. The latter is suited for large documents from which frequently appearing words and their context can be inferred. For short text, two short sentences with similar words can have very different meanings and the context depends on understanding the proximity of words. LDA's approach does not work as well with short text since most words only occur once in each short text, as a result, the Term Frequency-Inverse Document Frequency (TF-IDF) measure cannot work well in the short text setting. For instance, imagine all word frequencies in a sentence are 1 since the sentence is short with no repeating words. This provides no value.

A simple way to explain GSDMM is to imagine a classroom, with students seated randomly at K tables. They are all asked to write their favorite movies on a paper (but it must remain a short list). The objective is to cluster them in such a way that students within the same group share the same movie interest. To do so, as each student’s name is called by the professor, they must make a new table choice regarding the two following rules:

Choose a table with more students. This improves completeness, all students sharing the same movie’s interest are assigned to the same table.
Choose a table where students share similar movie’s interest. This rule aims to increase homogeneity, with the aim to have members sharing the same movie’s interest at a table. (Pelgrim, n.d.)

The tweet data is cleaned with Gensim and lemmatized with SpaCy. Then we can fit a GSDMM model to the data. We consider the following parameters:

Alpha and Beta work in opposite directions in cluster convergence and were experimentally varied around the default value of 0.1. Raising Beta to 0.5 (from a default of 0.1) for this dataset revealed more human understandable topics. The number of clusters selected (five topics) was based on convergence of GSDMM along with studying the semantic similarity of the words in each cluster for the various values of K.The generated model was then applied to the entire dataset. Since we wanted to separate out Everest from the ‘rest’ of the topics in the downstream task, of visualizing common Everest expedition related words against the rest, further distinction of topics was performed by taking advantage of hashtags that reveal the topic and merging together non-Everest topics, resulting into two categories : Everest and non-Everest (which we will call Entertainment).

Topic 0 and Topic 1 and Topic 3 have a higher frequency of everest related words such as base camp, summit, secure gear, nepal, southwest face, reach summit and hike
Topic 2 and Topic 4 Disney and entertainment related with words such as , 'animal_kingdom', 'flight_passage', 'donald_duck', 'time', 'pineapple_dolewhip', 'spend_money', 'tiana_expedition', 'jasmine_expedition'

The hyperparameters chosen therefore are: K=5, alpha=0.1, beta=0.5. The tokens in the cleaned lemmatized text are then fit to a ScatterText two-category visualization model. ScatterText uses scaled f-score), which takes into account the category-specific precision and term frequency (Kessler, 2016/2023). While a term may appear frequently in both categories (Everest and Entertainment), the scaled f-score determines whether the term is more "characteristic" of a category than the other. This results in the following visualization.

term	Everest frequeny	Entertainment frequency	Everest F-score	Entertainment F-score
mountain_conquer	3836	19	1.00	0.00
sir_edmund	2163	99	1.00	0.00
edmund_hillary	4107	82	1.00	0.00
hillary	2184	104	0.99	0.01
base_camp	1547	208	0.98	0.02
climb	4722	1383	0.96	0.04
high	1116	187	0.94	0.06
quote_day	948	13	0.94	0.06
people	1207	557	0.91	0.09
world	1013	387	0.90	0.10

Top-ten most frequent Everest words and their F-scores

mountain conquer sir edmund hillary quoteoftheday mountainsdontfightback conqueryourfear overcome.

The original quote : "It is not the mountain we conquer but ourselves."

monsoon beautiful namche day everest base camp trekking ebctrek everestbasecamptrek himalayanwarrior nepal khumbu moonsoon July
stillbeautiful acclitimization guidelife bikramkarki apexhimalayatrek

chhang dawa sherpa today army helicopter saijd search flight aerial reconnaissance hour maximum limit locate miss climber ali john snorri juan pablo mohrcorpse climber sherpas mount everest extreme weather prevent removal preservebreakingnew official climber fear miss avalanche sweeps mount Everest

leadership, leadership_courage, tenzing_norgay (Tenzing Norgay), die_new, supplemental_oxygen. bucketlist, challenge_charity, cost

This set of words describes the characteristics people are looking for when going on the trek and to some extent shows motivation such as challenge, charity or that it has been on their bucket list. The cost of the expedition will be of concern with these tours and trek becoming more expensive with popularity.

avalanche kill single deadly accident mount everestavalanche kill single deadly accident mount everest cnn cnnavalanche kill single deadly accident mount

ascent, training, nepal, china, trekking, sherpa, mountain_conquer, guide, reach_summit, internet, tent, airport

high speed internet mount everesthigh point world cell service internet capability high peak mount Everest
mile airport city mount everest base camp week trip chinafly tenze hillary airport lukla nepal dangerous world gateway everest base camp trekeverest summit expedition kick tomorrow march departure place henri coandă airport bucharest

From these clusters we see that with the onset of social media, motivations and experiences are shared openly. Clusters connect with the rapid rise in expeditions on peaks like Everest by those who take inspiration from notable climbers from the past. “It is not the mountain we conquer but ourselves.”, is Sir Edmund Hiillary’s most retweeted and posted quote in the tweet dataset. There is also an expression of desire to climb for “charity”, to “achieve”, as a “challenge”, and simultaneously being well aware of “cost” and practicalities such as “supplemental oxygen” and associated risk (“fatality”, “single_deadly”).

Statement of Work (April 2023)

Simi Talkar

- Project scout and environment setup
- Exploratory Data Analysis
- Dash App (lead and creator)
- Docker container
- Scraping and API retrieval of social media data and analysis (Twitter/Reddit)
- Final write-up (lead)

Brian Seko

- Data cleaning and structure
- Route Memo Clustering
- Route Memo Topic Modeling
- Climbing Period Feature Analysis (not included here)
- Final Write-Up (lead)

Matthieu Lienart