Topic Trends, Route Memos, and What They Tell Us About Climbers

We also seek to understand how motivations of climbers can be understood through their route memos. This documentation is curated in the Himalayan database from expeditions as early as 1905. Each memo contains information of the route a climbing team took, but also notes achievements, injuries, deaths, failed attempts and conflict occurring on the mountains. Many important data elements found in these memos are represented within the database, but there are rich stories in the memos that may provide insight to similarities between expeditions, which could not be obtained from raw data.

The Rise of Topic Trends and the Exposure to Danger

Using data from Reddit we unearth topics of interest in the posts. In Section 1 we find the range of topics that are distinctly separable with unsupervised learning applied on the full dataset. In Section 2 we split the dataset into train and test to evaluate the performance of topic modeling on unseen data. We also analyze topic trends in an attempt to understand the popularity of topics through the years. We observe connections with the discussion of technical/difficult routes and more novice climbers resolved to accomplish a bucket-list item.

Intent Discovery Using BERTopic with Individually Tuned Components

SECTION 1 Deriving Topics from Full Dataset with Unsupervised Learning

1032 route related post discussions filtered from 10504 posts collected from subreddits Everest, Mountaineering, and Alpinism formed the original dataset. As illustrated below, Everest route related posts are distinct from Mountaineering and alpinism which are more tightly clustered together. These latter subreddits also form a tight cluster off to the right in the figure which discusses boots and equipment specifically. This resulted in limiting the analysis to Everest subreddit route related posts (407) for unsupervised learning.

Unsupervised learning using BERTopic, provides a simple and efficient way to perform topic modeling on text data. It uses the state-of-the-art transformer-based language model BERT to generate embeddings for each document in the dataset, and then applies a clustering algorithm to group similar documents into topics.

The benefit of using BERTopic is that it allows one to quickly and easily identify the most important themes or topics within a large corpus of text data. Compared to traditional topic modeling techniques, such as Latent Dirichlet Allocation (LDA), BERTopic has several advantages (Grootendorst, n.d.):

Tune BERTopic Pipeline Components

We first start by breaking down the components of BERTopic i.e. the UMAP dimensionality reduction and HDBSCAN clustering to tune the individual parameters of these components. Although time-consuming and cumbersome, the findings from this manual tuning and visual analysis of the clusters created, established how distinctly separated or overlapping our clusters are. Clustering is strongly dependent on contexts, aims and decisions of the researcher as per the paper “What are true clusters?” (Henning, 2015). It is seen through this breakdown analysis and visualization that the min_dist parameter of UMAP, plays the most significant role in this dataset (full as well as subsequent train split). It controls how tightly UMAP is allowed to pack points together. It specifies the minimum distance apart that points are allowed to be in the low dimensional representation. Lower values of min_dist (0.07) resulted in tightly bound topic clusters. A more nuanced explanation of UMAP and HDBSCAN parameters can be found in the Appendix section of this document.

Automating Tuning of Parameters

Visual inspection of clustering provides an approximate range of parameters and we now refine the search for best parameters in this parameter space. The documents are converted into sentence embeddings using three from a set of pre-trained sentence transformers supported by BERTopic. Just as words can be encoded into word vectors, a sentence embedding is a numerical representation of a sentence and in its simplest form can be thought of as the average of all resulting word vectors. These embeddings facilitate the capture of semantic similarity between documents. As noted by (Reimers, 2019), “Sentence-BERT (SBERT), a modification of the pretrained BERT network use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity.”

Sentence Transformer	Length of Embedding
all-mpnet-base-v2 (BERTopic default)	768
all-MiniLM-L6-v2	384
paraphrase-mpnet-base-v2	768

Length of sentence embeddings based on sentence transformers

The dimensionality of the sentence embeddings above are not conducive to effective clustering leading us to employ UMAP dimensionality reduction as the next step in the topic modeling pipeline. UMAP capture better local and global high-dimensional space in lower dimensions.

This breakdown into components provided us with the ‘prior’ knowledge that was then used in setting a range for the optimization space for automated hypertuning using HyperOpt Bayesian parameter search. In the evaluation results below, it is seen that the least “loss” in data points in not being assigned a topic was 0.007371 with a UMAP min_dist of 0.08.

Index	Model	Loss	Label Count	N-neighbors	N-components	Min Cluster Size	Min Distance	Random State
0	all-mpnet-base-v2	0.058968	7	5	6	10	0.05	41
1	all-MiniLM-L6-v2	0.027027	7	3	5	10	0.08	41
2	paraphrase-mpnet-base-v2	0.007371	6	3	9	10	0.08	41

Result of Bayesian search for parameters using HyperOpt

Evaluation of the Pipeline and Loss Explanation

Cluster evaluation through automation is being performed with the intention of reducing the number of un-clustered data. Ideas about automating scoring and evaluating clusters non-visually have been taken from this article. The visual dimension reduction and clustering from above gives us a sense of the range of parameters to try out. We will perform parameter tuning with HyperOpt which implements Bayesian search hyperparameter tuning.
Adopting the article's cost function, the cost is defined as the ratio of number of points with a probabilities_ value below a set threshold where probabilities_ is set by HDBSCAN clustering of data points and provides the strength of confidence of a point being in a cluster. We couple this with the range of number of clusters we are “expecting” based on our knowledge derived from the earlier visualizations of the clusters. For the dataset with all Everest SubReddit routes, the objective is defined thus:

Minimize (Cost = percent of dataset with < 5% cluster label threshold)
While keeping number of clusters between : 2 < num_clusters < 10

Just because we place a point in a cluster does not mean optimal clustering, since each point can be “its own cluster” or one large cluster can contain all the points. And so, additionally, a 15% penalty is placed on the cost function if outside the desired range of clusters. This is where we benefit from having visually inspected the output of each module of BERTopic pipeline and conducted semantic analysis of clusters, in having a baseline range of clusters expected.

We use the parameters revealed for UMAP and HDBSCAN in the tuned BERTopic along with a CountVectorizer which removes stopwords. This results in a better model in terms of reduction in the number of documents unassigned to a topic as well as topics that have significant, interpretable words. Non-usage of CountVectorizer resulted in topics with stop words and this was amended in the next iteration where we split the data into train and test data with the use of cleaned text followed by sentence embeddings.

As a result of optimal parameters discovered through automated hypertuning, only 32 points are not clustered in the tuned model (assigned topic = -1) which means the number of points not clustered is reduced by 75% (down from 132). This optimized model also shows us that 68.8% of the clustered points, that is documents, relate to Everest base camp climbs. This optimized model resulted in a loss of certain topics such as “books and movies” related topics that the out-of-the-box BERTopic model surfaced, but we do see it has made a connection of the conversations to the usage of oxygen. This uses the following parameters specified as the best parameters by HyperOpt Bayesian parameter search.

n_neighbors= 3
n_components=9
min_cluster_size = 10
min_dist = 0.08
max_n_grams=3
random_state = 42

Semantic Evaluation of Topic Labels and Documents

(Full dataset tuned model)

Topic Trends in the Full Dataset

Topic trends show that the frequency with which people are attempting to climb Everest up until the base camp and the interest primarily lies in short trips on Everest and adjacent peaks. When focusing on the base camp treks which forms 68.8% of the discussions, in each of the spikes we find the words related to the logistics of the hike such as “lukla” (airport closest to the site from where base camp treks commence), books of interest and hobbies. In the first half of 2021, we see the words like infections and Covid trending, along with permits. Topic 4 which shows a spike in July 2019, predominantly revolved around trash. This topic mainly focused on oxygen usage and these two (oxygen and trash) discussions were clustered together in this manner, probably because oxygen canisters form a large part of the debris and garbage left behind.

Intent Discovery

Although BERTopic provides us with labels, using c-TF-IDF and CountVectorizer, intent discovery was additionally performed by selecting the noun chunks from each clustered topic, splitting this chunk into individual nouns and sorting by highest frequency. The nouns were then joined together to form an informative label. These labels were not found to be more informative than BERTopic labels and so the topics created by BERTopic are relied upon in further analysis.

Trek_guide_porter_ebc_day_camp_base_good_time_nepal
Ridge_climb_step_hillary_south_camp_summit_northeast_climber_north
Climb_sherpas_good_sherpa_climber_ladder_summit_big_camp_story
Climb_peak_summit_trail_body_time_difficult_climber_d_point
body_summit_report_attempt_team_remove_mallory_ray_snow_day

SECTION 2 Evaluation of Unsupervised Learning with Manual Labeling

In order to be able to evaluate how good the unsupervised learning task is on unseen data, we split the subreddit Everest route related dataset into train and test split. The train set contains 382 documents and the test set contains 25 documents. Another difference in this analysis is the usage of cleaned text as opposed to the usage of raw reddit data in Section 1.

Cleaning List
Extract word tokens, remove excess spaces, remove links
Add to stop words 'route', 'think', 'thank', 'people', 'everest', 'mountain', 'm', 'I', 'd'

Cleaning and SpaCy lemmatizing applied on posts on train set documents

But as mentioned above, with just a sentence transformer followed by dimensionality reduction and clustering and no CountVectorizer with removal of stop words, resulted in topics dominated by stop words such as ‘to’, ‘and’, ‘of’ despite the usage of reduce_frequent_words parameter in the ClassTf-Idf. In this section, we therefore begin with cleaning text by removing unnecessary characters and formatting, removing stopwords and lemmatizing the words with SpaCy. The CountVectorizer was created with 2-gram since there is common usage like “Mount Everest”, “Hiking Boots”, “Himalayan Expedition”.

Topic	Count	Name
-1	21	-1_death_death zone_zone_trip
0	126	0_climb_summit_climber_body
1	11	1_furtenbach_covid_sherpa_camp
2	224	2_trek_day_guide_porter

Base model topic assignments of train set documents

When using only a reduced dataset for training, the number of topics created drastically reduces as seen above. We have lost topics revolving around oxygen usage and books for instance. But it also surfaces the real concern of the Everest Sub-Redditers. The baseline has three clusters into which the topics were placed. 94.5% of the documents were placed into a topic, setting a very high baseline. The topic -1 below is the only set of documents not clustered into a definitive cluster.

BertTopic with Tuned Components in the Train Set

The ranges for the parameters for HyperOpt Bayesian Parameter Search were taken from the unsupervised learning applied on the full dataset task, where visualizing the cluster separation and semantic understanding of topics in a cluster provided us with a reasonable range. The best sentence embedding is the default sentence embedding used by BERTopic. Unlike the analysis in the above section, we are using this sentence embedding on cleaned lemmatized tokens as it suggested more illuminative topic labels.

When creating the tuned model though, it was seen that a better separation of topics, more nuanced and distinct topics arose from changing the min_dist parameter for UMAP. This has been the case throughout the analysis (from the unsupervised section to the partially supervised Section 2). This parameter, when set to a low value, aids in “emphasizing the similarity of dense clusters” of samples (Armstrong, n.d.). The hierarchical linkage of topics created is seen below shows topics 1, 2 and 3 regarding short treks, base camp treks and first time trekker planning logistics are hierarchically linked at the first levels by agglomerative clustering algorithm HDBSCAN.

These three together account for nearly 60% of the posts. The second largest clustering is of climber routes, bodies and remains in Topic 0 which is about technical discussions about the routes. The last clustering falls into Topic 4 which is distinctly about Covid pandemic and the impact of Covid on Sherpas and their livelihood.

Semantic Evaluation of Topic Labels and Documents

(Train set tuned model)

Topic Trends in Train Dataset

Although there is a loss of topics from reducing the size of the dataset to split and evaluate it, the primary topics of interest continue to be ascents to the base camp and short trips. The three topics that deal with shorter guided treks (topics 1, 2 and 3) also spike in frequency with words showing their significant contributions such as “care, ease, respectful”. The impact of Covid on the livelihood of Sherpas is strikingly noticeable in 2021 postings. As in the analysis of the full dataset, death (bodies and remains) and route technicalities in topic 0 is a smaller but frequent topic of discussion indicating, not surprisingly perhaps, how critical the choice of routes is to success of an expedition. This discussion is also distinct from the larger set of topics about base camp treks that can be presumed to be less experienced climbers.

Evaluation of Model

The test set without any predicted labels attached was labeled manually by each of the team members. The homogeneity in their ratings or lack thereof was evaluated using Krippendorff’s Alpha. This formula is a statistical measure of the agreement achieved when applying a set of labels This formula outputs a value between 0 and 1, where 1 is perfect alignment. Scores above 0.8 are considered “good reliability”, scores less than 0.667 are considered inadequate, and indicate no inter-rater reliability (Krippendorff, 2004).
This value for the small test set was barely above the minimum acceptable threshold at 0.67.

From among the 12 documents that had the same label across all the raters, the prediction from the model matched 7 leading to an accuracy value of 58%. This value is too low to certify the model as a valuable tool as a topic modeler for Reddit post topic modeling.

Those who attempt the climb are clearly aware of their own mortality and the risks. Bodies of past explorers like David Sharp and “green boots” remains on the routes are discussed and news of climbers meeting with their demise is tweeted. Despite this awareness, leadership and courage shown by Edmund Hillary and Sherpa Tensing Norgay continues to resonate with expeditioners, 70 years since their summiting the mountain. The interest shown in social media is focused not on “conquering the mountain”, as much as to climbing to the base camp(s) which is a feat in itself as they are at over 5000 meters. The posts regarding impact of Covid on the economy of Nepal and livelihood of Sherpas indicates how dependent their lives have become on the rise of expeditions as they form a large support system for expeditions to the base camp and beyond.

Like a Rock: The Steady, Unchanged Climbers

We hypothesize that route memos from different climbing periods have differentiated content. Do topics, thoughts, and stories change over time similar to how Himalayan expeditions have channeled new eras? In (Savage & Torgler, 2013) it is noted that “To date there has been very little evidence demonstrating shifts in social norms, emotions or group identity over time in extreme, or life and death situations.” We will explore this sentiment by looking at the content of the route memos. From (Collins-Thompson, n.d.) We have learned the clustering can be used to identify similar groups from a corpus of documents. We will follow this approach to explore our question.

Evaluating Clustering as a Concept

First, we will evaluate if it is possible to create clusters from our text data. That is, to explore what a clustering task could produce based on clustering quality principles (Collins-Thompson, n.d.) :

Our task will focus on concepts 1 and 2 at this stage, specifically establishing the appropriate number of clusters. We will use two indexes to measure the ratio of within-group variance to between-group variance. These indexes are the Davies-Bouldin index (DBI) and the Calinski-Harabasz Index (CHI). The scores will be cross referenced to find the “optimal” number of clusters (Collins-Thompson, n.d.).

DBI has an intuitive interpretation, it measures the average similarity between each cluster and the most other similar cluster(s). When the clusters are very different the score becomes smaller. Also, this metric is rather robust against noise, and so in early stages of assessing the data without in-depth cleaning, this is ideal. To check the validity of the scores, another metric should be used to cross-reference results. There are many options, but the CHI is another intuitive starting point. It's an easy-to-understand metric like DBI; here we look at the ratio of the between-cluster variance to the within-cluster variance. A higher score means the clusters are well formed and separate. Also, this metric is a bit robust against noise, so again, we don’t need to be overly concerned with complex cleaning to assess if this is working. When we review the scores with DBI, the lower score is better and with CHI higher is better.

We propose a novel way to compare the two scores. We can normalize the values, so the highest score is always 1 and the lowest score is always 0. Then we can weight the DBI and combine the scores, so the highest value represents the best combinations of scores. This is imperfect, so it will still require manual inspection, but it allows us to greatly narrow down the K selection.

From our results we can see an optimal number of clusters around a K of 5. DBI suggests that the ideal K is somewhere between 5 and 9, while the CHI shows a sharp drop off initially and then slowly tails off. When we combine these scores, we converge around 5, which is suitable for us to evaluate the task at this stage.

Pre-processing

The type of pre-processing on the data will impact your results (Kutuzov & Kuzmenko, 2019). The choice to lemmatize words (reducing them to their roots), dropping stop words, dealing with abbreviations, dates, and names all have an impact. Words are lowered since we are not concerned with sentence structure. Extra white spaces are removed, along with excessive new lines and special characters. Numerical values were not removed, and when identified, feet were converted to meters.

Abbreviations also were prevalent in the un-processed text. (Okazaki & Ananiadou, 2006) propose methods in which to identify abbreviations from text within the sentence. We can adopt this line of thinking and apply it to the corpus since abbreviations are domain specific (e.g., BIV for bivouac). This was done by first creating a list of all possible abbreviations, this was achieved using regex to search for content that was not a stop word, and that was only 3 letters. While 4 letter abbreviations are possible, they were not common within the corpus, so they were not considered. Next, following (Schwartz & Hearst, 2002), a c-value score was calculated at the corpus level (rather than sentence). From this the highest c-value abbreviation definitions were chosen. Finally, the frequency of these abbreviations was assessed and most abbreviations that were useful did not benefit from being decoded (e.g., occurring less than a few hundred times in the corpus). Of the group specific abbreviation that high frequency they were added to the cleaning. As noted, lemming (reducing a word to its root) can be a helpful step (Kutuzov & Kuzmenko, 2019). We include it in our evaluation and create a lemmed vs non-lemmed corpus. Finally, we will split the data into four distinct groups that represent the different climbing periods.

Broader Hyper-Parameter Tuning with Grid Search

Using a grid search we can test many combinations of hyper-parameters. N-grams over 3 produced few clusters during testing because as the size of the ngrams increases, the number of possible combinations decreases (Speech and Language Processing, n.d.). As the size of the ngrams increases, the resulting clusters may become more specific and less general. This can lead to fewer clusters overall, as the clusters become more distinct and specific to certain patterns within the data (Aggarwal & Reddy, 2016).

Our grid search evaluated over 900 possible combinations; from this it was observed that:

When we evaluate model performance, it is natural to compare the results of one ngram type, and period to another. We cannot do this with the indexes chosen. The results are specific to the model and it is invalid to make a side by side comparison. We can, though, look for trends that suggest some combinations are performing better than others. In the below images the results show scores on the y-axis for each hyperparameter/ngram/period set, which is defined as an index on the x-axis. Non-lemmed sets are colored gray, lemmed sets are colored blue.

We see models compared between the exploratory unigram (left), and transitional unigram result (right). As the index increases from left to right on the x-axis the number of clusters increases. We see from these results more clusters performed better than fewer clusters. It is also noticeable that there is no difference between the lemmed and non-lemmed versions of the input corpus.

Some results suggested that the models had outliers or were overtrained. We see this in the commercial unigram results with Calinski-Harabasz scores that are in the millions. While scores of several hundred thousand can be seen, scores in the millions suggest issues with the input (Caliński & Harabasz, 1974). However, there is enough data from the other models to overcome these results and we can conclude that ngrams higher than 7 showed better results. The other hyper-parameters had little impact and so our final will use a probabilistic search method such as RandomSearchCV to optimize the model. Overall, focusing more on the Davies-Bouldin scores which showed fewer issues, bigrams and unigrams had lower scores. As a result, we focus on these ngram types for the final model.

From the insights we have gathered visually exploring tuned models can help determine our final configuration. First, we will look at a cluster from the exploratory period, which is optimized at 9 clusters. This output does not meet our definition of quality. There are few clusters noted, the data is evenly spread out and amorphous, instead of having distinct clusters forming. One issue could be that creating groups per period may be too restrictive. This could also explain why the Calinski-Harabasz scores became so high, which may be a byproduct of over training. Another approach would be to split the data into two periods, early vs late. If we dichotomize the data into pre-1970 and post-1970 expeditions, there may be more data to form clusters. Unfortunately, we can see the results are not any better using this approach. However, when we use the entire data set, bigrams, and trigrams both produce promising clusters.

Trying different topic clustering.
Dividing the route memo data does not produce quality clusters based on our criteria. Using the entire corpus we see that bigrams and trigrams produce separated and distinct clusters

The results are salient. Breaking the data into periods does not produce better models. When we look at the data as a whole we see groups emerge from the route memo clusters. Visually, trigrams have clusters that are more separated, and this is intuitive when we consider the lower probability of trigram words passing term frequency thresholds. The clusters divide the content into periods, however we cannot conclude content may be related to specific periods of time.

Label: 0 Years: 1905 - 2022
Label: 1 Years: 1952 - 2020
Label: 2 Years: 1929 - 2022
Label: 3 Years: 1949 - 2022
Label: 4 Years: 1966 - 2019
Label: 5 Years: 1986 - 2022
Label: 6 Years: 1988 - 2022
Label: 7 Years: 1985 - 2022
Label: 8 Years: 1983 - 2022
Label: 9 Years: 1950 - 2021

Here we evaluate the expeditions from the labels that have considerable overlap. Our concept that the memo content may be related to specific periods does not hold. We can reason this for the following reasons:

Topic Modeling

In addition to the period clusters performing poorly visually, they also did not produce topics that were sufficiently different between groups, or even for the entire period. This is consistent with the findings that clusters were not well formed when we separated the expeditions by period. We ask if this suggests that route memo content has not changed? To explore this further we will look at a few different methods to extract meaningful topics from our data.
First, we can use a term score as a simple method to see if this produces topics that appear to be differentiated. What we are looking for are topics that seem specific to a group, but not globally. If a topic can be applied to many clusters, it does not represent a distinct group. By extracting the terms from the clusters we can use the topics close to the centroid as the “best” representation for that group. Then, applying a term score that evaluates how specific each word is to a particular cluster (Collins-Thompson, n.d.). A simple way to think about a term score is to evaluate how surprised we are that a word exists within a specific cluster against the corpus and other clusters. If a word is more likely to occur within a cluster than by chance, we would consider it a good representation of that cluster.
While this sounds promising, the results do not lend themselves to interpretation well. Evaluating results from clusters of the exploratory period we get the following topics:

The results show many repeated words in each cluster, and topics that are difficult to identify as unique. While some combinations of words like “really did climb” or “main summit hour” appear somewhat unique the noise from “mr farmer” suggest too few documents in this period to be useful, or the over abundant documentation of a single expeditioner (most likely the former). Evaluating other periods gave similar results.The results do not provide a clear sense for each group.
We must consider if noise from the presence of noise is throwing off our topics. One way to test this is to remove words that are not parts of speech. Using spaCy, we can use a dictionary to determine if a word is a noun, verb, etc., and keep only parts of speech that have syntactic meaning. When we do this, we produce the following clusters from the trigrams of the exploration period:

While the noise is cut down, we can observe that without the grammatical particles present the topics are not better (we conclude this objectively) from the above clusters with “noise”. They are easier for human interpretation without the grammatical particles, but do not make a better representation of each group. Using the same approach for the “early” period we get cleaner, but not better results:

Neither approach, using parts of speech only, or the entire corpus, shows differentiated topics for each cluster.

Term frequency is not the only way to approach topic modeling. There are several different topic modeling algorithms, another method is Latent Dirichlet Allocation (LDA). LDA works by analyzing the words in the documents and grouping them into topics based on how frequently certain words are used together. It assumes that each document is made up of a mixture of different topics, and that each topic is made up of a set of related words. LDA is a powerful tool for analyzing text, but it is important to remember that it is just a model and the topics that are identified by LDA may not be the same as the topics that a human would identify. However, LDA can still be a useful tool for understanding the content of a corpus of text (Jelodar et al., 2019).

Using LDA to one the exploration period does not produce topics that suggest clear separation of content. Here we will evaluate the 3 most frequent words:

LDA has a parameter lambda that allows us to control how words are selected. Values of lambda that are very close to zero will show terms that are more specific for a chosen topic. This means that we will see terms that are important for a specific document but not globally. When we lower lambda on our exploratory period we get:

The returned values become very document specific and cannot be interpreted globally. Additionally, the topics provide little insight to important period characteristics. When we apply the same method to the global trigram clusters, our results are equally poor:

Once again, reducing lambda does not improve the topics:

Using the clusters without grammar particles, attempts are also nonsensical, and uninterpretable.
Our final attempt to garner some results could be to use an entirely different method to date. ChatIntents is a package that automatically clusters and applies descriptive group labels to short text documents, which is an ideal algorithm for our corpus (Borrelli, 2021). Using this on the all-period trigrams we can assess each cluster:

Using unsupervised methods to discover topics from within clusters or periods produces unintuitive, uninterpretable results. Clusters and periods surface words that are globally related to climbing or are so document specific have little meaning to a subgroup of data. Directions, camps, ropes, summits, and snow are expected content from a mountain climber expedition memo. Our results produce little to suggest interesting and informative content exists within the documents that are not already described by the structured data. It is interesting however, that Everest was separated as a topic. We observe overlap here to results of the social-media analysis.

Ground Truth Labels

Human created ground truth labels are objectively better than unsupervised attempts to produce labels (e.g., topics) because they are more accurate and reliable (Odumuyiwa et al., 2022). Human annotators can understand the context of the data and apply their knowledge to label it correctly. In a final attempt to produce labels that could be used to identify a topic shift per period we assess how humans label topics from a sample of memos. Randomly, 50 memos were taken from the corpus. From the memos, topics were developed and criteria set for the evaluation of each topic type (see appendix).

These topics were then applied to the 50 memos. Two volunteers were asked to read and additionally apply one of the topic labels to each memo while being blinded from other attempts. The labels were then evaluated using Krippendorf’s alpha to determine the reliability of the labels (Krippendorff, 2004).

The results for this label set were 0.63, which indicates no reliability. Labels were reassessed to reduce overlap and focus on specific content. This reduced the label set to “Bad conditions”, “Factual Route Description”, and “Injuries/Accidents/Death” (see appendix for the full label set). Using only three labels would likely increase inter-rater reliability, but would be too general to determine topic shift from between periods.

The sample memos were then evaluated for word choices best represented each memo. The task was to use human interpretation and produce 3 words that best described each memo. The results of this attempt were too document specific and words/labels generated did not apply to other memos. From these attempts it was determined that it was not feasible/possible to apply labels to the memo content and create distinguishable categories.

From this analysis we found that clusters by climbing period (exploration, expedition, transitional, commercial and social-media) did not form quality, well-separated clusters. Instead, clusters appeared to be well formed when taking input from the entire corpus. The clusters we saw did not differentiate data into specific periods, most clusters spanned multiple periods suggesting that content is not time specific. Our expectations would be to see memos of similar writing style, vernacular, or content (e.g., injuries, death, success) to be grouped together. If this content varied by period, the expeditions that appeared in each cluster would reflect this and early years would cluster together vs expeditions from later periods clustering together. What we saw was that content could be differentiated, but the year of the expedition was not significant meaning that content has not changed over time.

We also attempted to isolate topics from clusters, and periods from the route memos. From various attempts to identify topics through unsupervised methods, as well as human intervention, we did not pass reliability thresholds or create topics that articulated distinct characteristics of each group. Topics were similar in content and language, which mirrors the results from clustering – which is that the content falls into categories, but these categories are similar and do not provide us with insight about methods, actions, or the history of climbing.

When we reflect on the memo content, the results are intuitive. Memos are written for future climbers, cataloging important route characteristics and stories of those who may be the first to discover a path that will be traveled for years to come. If we consider the number of first ascents over time, we see that in the past 10 years there are as many first ascents occurring as there were 25 to 35 years ago.

This suggests that the drivers to undertake these dangerous climbs haven’t changed with time. Technology, gear, and information may allow newcomers to stand at the top of the world. However, the hard-core, mountaineering experts continue to explore the region with the same intent, difficulties, and desire as those that came before us.

Topic Trends, Route Memos, and What They Tell Us About Climbers

The Rise of Topic Trends and the Exposure to Danger

Intent Discovery Using BERTopic with Individually Tuned Components

SECTION 1 Deriving Topics from Full Dataset with Unsupervised Learning

Tune BERTopic Pipeline Components

Automating Tuning of Parameters

Evaluation of the Pipeline and Loss Explanation

Semantic Evaluation of Topic Labels and Documents

Topic Trends in the Full Dataset

Intent Discovery

SECTION 2 Evaluation of Unsupervised Learning with Manual Labeling

BertTopic with Tuned Components in the Train Set

Semantic Evaluation of Topic Labels and Documents

Topic Trends in Train Dataset

Evaluation of Model

Like a Rock: The Steady, Unchanged Climbers

Evaluating Clustering as a Concept

Pre-processing

Broader Hyper-Parameter Tuning with Grid Search

Topic Modeling

Ground Truth Labels

Statement of Work (April 2023)

Simi Talkar

Brian Seko

Matthieu Lienart