Xiao Lin: Leveraging Multimodal Perspectives to Learn Common Sense for Vision and Language Tasks. Virginia Tech, Blacksburg, VA, USA 2017