This data set contains MeSH tags of 7470 cases with 2696 normal cases. For the MeSH tags, the root words and the later words are of different significance. The root words set the tone or topic of the full sentence whereas the later words are qualifiers that describe the situation. The full data set can be seen at the end of the post.
The goal here is to create word embedding for the root words such that similar words are closer. Applying word2vec algorithm in gensim gave very poor result on this data set. We thus tried the Hellinger PCA method, which performs a dimension reduction on the co-occurrence matrix.
The co-occurrence matrix encodes the conditional probability of word co-occurrence. The text (corpus) is first used to form dictionary words and context words. The co-occurrence matrix has dictionary words as rows and context words as columns.
In this data set, there are 1679 unique MeSH sentences, 178 unique words and 118 unique root words.
There are 23 words that only appeared once, and 12 words that only appear twice in the data set, as seen in Table 1. For these 23 one-appearance words, 11 of them show up as one-word sentences. We thus remove them and arrive at a dictionary with 167 words, and the text thus contains 1668 sentences.
|one appearance||two appearances|
|technical quality of image unsatisfactory||hyperostosis, diffuse idiopathic skeletal|
|cystic fibrosis||cardiophrenic angle|
|pectus carinatum||bone and bones|
|hypertension, pulmonary||hernia, diaphragmatic|
|pulmonary disease, chronic obstructive||multilobar|
|expansile bone lesions|
Table 1. Words that only appear once or twice in the data set. The words in bold font are stand-alone sentences and are removed from the dictionary.
construct co-occurrence matrix
The dictionary words are the ones to get word embedding and we choose the root words of the MeSH sentences to be dictionary words. There are 107 of them.
The context words are basically raw features. In this data set, there are 87 qualifier words with a total of 4413 apperances. The common way to pick the context words is to use the top 10% or 20% of the words that occur most frequently. I don’t think it works in this case. The top 30 qualifier words in this data set are shown in Table 2. They account for 3806 apperances.
Table 2. Most frequently occurring qualifier words.
Many of these qualifier words describe specific features, as shown in Table 3, or referring to an organ. If they were used to define the dictionary words, then closeness between dictionary words would merely indicate their similarity in the left-right symmetry, etc.
Even after PPMI reweighting, the word embedding still does not look good under t-SNE: most points are close together and their distances do not seem to make much sense.
|left-right||left, right, bilateral|
|distribution||patchy, focal, streaky, diffuse, scattered|
|severity||mild, severe, moderate, borderline|
Table 3. Context words that describe a specific feature.
So far the embedding does not work. Maybe all the feature vectors should be removed first? Or maybe I should not use the root words alone as target words. Or is the data size too small?
appendix: the full data set
The full data set is shown as follows. The ‘root’ node is only for display purposes. All words in blue boxes except ‘root’ are the root words in MeSH sentences. The qualifiers are in orange boxes. A blank orange box means that the root word can be used alone.