MeSH tags

This data set contains MeSH tags of 7470 cases with 2696 normal cases. For the MeSH tags, the root words and the later words are of different significance. The root words set the tone or topic of the full sentence whereas the later words are qualifiers that describe the situation. The full data set can be seen at the end of the post.

The goal here is to create word embedding for the root words such that similar words are closer. Applying word2vec algorithm in gensim gave very poor result on this data set. We thus tried the Hellinger PCA method, which performs a dimension reduction on the co-occurrence matrix.

The co-occurrence matrix encodes the conditional probability of word co-occurrence. The text (corpus) is first used to form dictionary words and context words. The co-occurrence matrix has dictionary words as rows and context words as columns.

pre-processing

In this data set, there are 1679 unique MeSH sentences, 178 unique words and 118 unique root words.

There are 23 words that only appeared once, and 12 words that only appear twice in the data set, as seen in Table 1. For these 23 one-appearance words, 11 of them show up as one-word sentences. We thus remove them and arrive at a dictionary with 167 words, and the text thus contains 1668 sentences.

one appearance	two appearances
osteoporosis	aortic aneurysm
sarcoidosis	tuberculosis
no indexing	acute
technical quality of image unsatisfactory	hyperostosis, diffuse idiopathic skeletal
cystic fibrosis	cardiophrenic angle
bronchiolitis	funnel chest
cholelithiasis	contrast media
normal	pneumoperitoneum
pectus carinatum	bone and bones
hypertension, pulmonary	hernia, diaphragmatic
pulmonary disease, chronic obstructive	multilobar
hypovolemia	mastectomy
pleural sinus
hemopneumothorax
bronchitis
supracardiac
esophagus
fibrosis
colonic interposition
azygos lobe
expansile bone lesions
aortic valve
hemothorax

Table 1. Words that only appear once or twice in the data set. The words in bold font are stand-alone sentences and are removed from the dictionary.

construct co-occurrence matrix

The dictionary words are the ones to get word embedding and we choose the root words of the MeSH sentences to be dictionary words. There are 107 of them.

The context words are basically raw features. In this data set, there are 87 qualifier words with a total of 4413 apperances. The common way to pick the context words is to use the top 10% or 20% of the words that occur most frequently. I don’t think it works in this case. The top 30 qualifier words in this data set are shown in Table 2. They account for 3806 apperances.

qualifier word	counts
lung	649
right	450
left	346
bilateral	272
mild	227
base	196
multiple	168
upper lobe	115
lower lobe	114
round	94
small	93
hilum	91
interstitial	87
prominent	82
patchy	70
apex	68
posterior	65
thoracic vertebrae	61
thorax	58
middle lobe	55
scattered	53
ribs	52
severe	52
chronic	51
focal	45
degenerative	44
streaky	41
pleura	40
enlarged	34
diffuse	33

Table 2. Most frequently occurring qualifier words.

Many of these qualifier words describe specific features, as shown in Table 3, or referring to an organ. If they were used to define the dictionary words, then closeness between dictionary words would merely indicate their similarity in the left-right symmetry, etc.

Even after PPMI reweighting, the word embedding still does not look good under t-SNE: most points are close together and their distances do not seem to make much sense.

feature	context words
left-right	left, right, bilateral
front-back	anterior, posterior
up-down	apex, base
distribution	patchy, focal, streaky, diffuse, scattered
shape	round, irregular
stage	acute, chronic
size	small, large
visibility	prominent, obscured
severity	mild, severe, moderate, borderline

Table 3. Context words that describe a specific feature.

summary

So far the embedding does not work. Maybe all the feature vectors should be removed first? Or maybe I should not use the root words alone as target words. Or is the data size too small?

references

appendix: the full data set

The full data set is shown as follows. The ‘root’ node is only for display purposes. All words in blue boxes except ‘root’ are the root words in MeSH sentences. The qualifiers are in orange boxes. A blank orange box means that the root word can be used alone.