Interest in expeditions especially on the coveted summit, Everest, was slow to pick up over decades. But something changed in the late 1990s. While a direct correlation is not being drawn here, social media platforms also started around this period and truly took off in the 2000s.
Number of Expeditions per Year on Mount Everest. We see a sharp increase in the number of expeditions per year after the start of commercial expeditions in 1988.
Reddit was founded on June 23, 2005 and close on heels, Tweeters began posting in 2006. Reddit has two popular trekking related subreddits : Mountaineering and Alpinism besides the Everest Subreddit. From Reddit posts we will derive topics (intent discovery) and topic trends.
In Twitter we find veteran mountaineers who have summited Himalayan peaks multiple times such as Alan Arnett (@alan_arnette) and Kenton Cool (@KentonCool) who coach and motivate aspirants. In Tweets, we are looking for motivating factors that drive people to climb the Everest and perceptions people have of the trek up the Himalayan mountains.
To approach our understanding of how social media connects with climbing, 117,093 tweets were extracted from 2006 to 2023 using Snscrape and the Twitter API. The data extracted shows a mix of topics. Ones which are relating to our interest contain tones of pride:
However, metaphors and idioms create noise. Everest is used synonymously as the “unattainable” or “very massive” in common language and also appears in terms related to entertainment such as the National Geographic series on Everest and the Disney ride "Expedition Everest". Our challenge lies in first disambiguating Himalayan peak Everest related tweets to gather Tweeter interest, from the rest.
We first must reduce this data set and remove tweets with obvious references to the entertainment domain.
The Gibbs Sampling Dirichlet Mixture Model (GSDMM) is an “altered” LDA algorithm, showing great results on Short Text Topic Modeling tasks, that makes the initial assumption: 1 topic describes 1 document. The words within a document are generated using the same unique topic, and not from a mixture of topics. By contrast a method like Latent Dirichlet Algorithm LDA uses a bag-of-words generative approach which draws from a probability distribution of topics in a document and a distribution of words in a topic. The latter is suited for large documents from which frequently appearing words and their context can be inferred. For short text, two short sentences with similar words can have very different meanings and the context depends on understanding the proximity of words. LDA's approach does not work as well with short text since most words only occur once in each short text, as a result, the Term Frequency-Inverse Document Frequency (TF-IDF) measure cannot work well in the short text setting. For instance, imagine all word frequencies in a sentence are 1 since the sentence is short with no repeating words. This provides no value.
A simple way to explain GSDMM is to imagine a classroom, with students seated randomly at K tables. They are all asked to write their favorite movies on a paper (but it must remain a short list). The objective is to cluster them in such a way that students within the same group share the same movie interest. To do so, as each student’s name is called by the professor, they must make a new table choice regarding the two following rules:
The tweet data is cleaned with Gensim and lemmatized with SpaCy. Then we can fit a GSDMM model to the data. We consider the following parameters:
Alpha and Beta work in opposite directions in cluster convergence and were experimentally varied around the default value of 0.1. Raising Beta to 0.5 (from a default of 0.1) for this dataset revealed more human understandable topics. The number of clusters selected (five topics) was based on convergence of GSDMM along with studying the semantic similarity of the words in each cluster for the various values of K.The generated model was then applied to the entire dataset. Since we wanted to separate out Everest from the ‘rest’ of the topics in the downstream task, of visualizing common Everest expedition related words against the rest, further distinction of topics was performed by taking advantage of hashtags that reveal the topic and merging together non-Everest topics, resulting into two categories : Everest and non-Everest (which we will call Entertainment).
The hyperparameters chosen therefore are: K=5, alpha=0.1, beta=0.5. The tokens in the cleaned lemmatized text are then fit to a ScatterText two-category visualization model. ScatterText uses scaled f-score), which takes into account the category-specific precision and term frequency (Kessler, 2016/2023). While a term may appear frequently in both categories (Everest and Entertainment), the scaled f-score determines whether the term is more "characteristic" of a category than the other. This results in the following visualization.
term | Everest frequeny | Entertainment frequency | Everest F-score | Entertainment F-score |
---|---|---|---|---|
mountain_conquer | 3836 | 19 | 1.00 | 0.00 |
sir_edmund | 2163 | 99 | 1.00 | 0.00 |
edmund_hillary | 4107 | 82 | 1.00 | 0.00 |
hillary | 2184 | 104 | 0.99 | 0.01 |
base_camp | 1547 | 208 | 0.98 | 0.02 |
climb | 4722 | 1383 | 0.96 | 0.04 |
high | 1116 | 187 | 0.94 | 0.06 |
quote_day | 948 | 13 | 0.94 | 0.06 |
people | 1207 | 557 | 0.91 | 0.09 |
world | 1013 | 387 | 0.90 | 0.10 |
Top-ten most frequent Everest words and their F-scores
mountain conquer sir edmund hillary quoteoftheday mountainsdontfightback conqueryourfear overcome.
The original quote : "It is not the mountain we conquer but ourselves."
monsoon beautiful namche day everest base camp trekking ebctrek everestbasecamptrek himalayanwarrior nepal khumbu moonsoon July
stillbeautiful acclitimization guidelife bikramkarki apexhimalayatrek
chhang dawa sherpa today army helicopter saijd search flight aerial reconnaissance hour maximum limit locate miss climber ali john snorri juan pablo mohrcorpse climber sherpas mount everest extreme weather prevent removal preservebreakingnew official climber fear miss avalanche sweeps mount Everest
leadership, leadership_courage, tenzing_norgay (Tenzing Norgay), die_new, supplemental_oxygen. bucketlist, challenge_charity, cost
Sir Edmund Hillary
This set of words describes the characteristics people are looking for when going on the trek and to some extent shows motivation such as challenge, charity or that it has been on their bucket list. The cost of the expedition will be of concern with these tours and trek becoming more expensive with popularity.
avalanche kill single deadly accident mount everestavalanche kill single deadly accident mount everest cnn cnnavalanche kill single deadly accident mount
ascent, training, nepal, china, trekking, sherpa, mountain_conquer, guide, reach_summit, internet, tent, airport
high speed internet mount everesthigh point world cell service internet capability high peak mount Everest
mile airport city mount everest base camp week trip chinafly tenze hillary airport lukla nepal dangerous world gateway everest base camp trekeverest summit expedition kick tomorrow march departure place henri coandă airport bucharest
From these clusters we see that with the onset of social media, motivations and experiences are shared openly. Clusters connect with the rapid rise in expeditions on peaks like Everest by those who take inspiration from notable climbers from the past. “It is not the mountain we conquer but ourselves.”, is Sir Edmund Hiillary’s most retweeted and posted quote in the tweet dataset. There is also an expression of desire to climb for “charity”, to “achieve”, as a “challenge”, and simultaneously being well aware of “cost” and practicalities such as “supplemental oxygen” and associated risk (“fatality”, “single_deadly”).