Visual question answering based on local-scene-aware referring expression generation. (July 2021)