Bouraoui et al. (2020) - Inducing Relational Knowledge from BERT - They distill relational
knowledge from a PLM by using sentences with related terms as templates. At one point they say,
“even if language models capture relational knowledge, it is important to find the right sentences
to extract that knowledge. I’m not sure I agree that minor perturbations in prompts resulting in a
collapse in the facade of knowledge are merely an obstacle to overcome. I think that it is rather
indicative of the PLM not actually possessing relational knowledge, but rather possessing some sort
of vague association between words in a sentence. As in capital-of seems easy for PLMs, but
that is likely because they are often explicitly expressed either as X is the capital of Y or just
as X, Y or the like, i.e. they occur often together. This is corroborated by the is-colour relations
never working. A trait often not specified unless it is a trait specific to a given instance of an
item. Actually, the use of these templates is cart before horse? As in, using Paris is located in
central France and replacing Paris with London and France with England would be a better test
that the PLM has actually learnt relational knowledge, i.e. you would want London is the capital
of England to score high and London is located in central England to score low, but if it has
just learn to associate London and England, then you’d expect similar scores.
They then train a classifier (also updating the weights of bert) which is trained to predict whether
an assertion is true or not. They create negative samples by scrambling real samples and by reversing
them. These seems awfully circular. You used bert to come up with templates and then you use
these templates to test if bert has learnt anything about these relations by training bert on these
templates. A strong hmmm. I also wonder if using a classifier to predict whether a pair is related
based on a given template can only give a shallow indication of relational knowledge being encoded
in a PLM. Maybe it would be better to try and predict the template given the pair? I.e.
Paris is the [MASK] of France. Might work for BERT but not for other PLMs?
This shows that the BERT language model indeed captures commonsense and factual
knowledge to a greater extent than word vectors, and that such knowledge can be extracted
from these models in a fully automated way.
I’m not sure I agree with this conclusion. In contexts where relational information is encoded, a
BERT can be fine-tuned to predict semantic relations based on these prompts.
References
Bouraoui, Z., Camacho-Collados, J., and Schockaert, S. 2020. Inducing relational knowl-
edge from bert. Proceedings of the AAAI Conference on Artificial Intelligence 34, 5, 7456–7463.
1