In recent years, a great deal of interest in research of Visual Question Answering (VQA) has been propounded it as a hot topic in computer vision. Many sub-problems were raised in this regard, and reasonable efforts have been made to solve them. Considering salient elements of different modalities, discovering inter or intra correlation, proper information fusion method, using supplementary information of external knowledge bases, visual reasoning, and accepting correct answers that have not been seen before in the training set are examples of these issues. In this paper, the focus is on reinforcing the model by reasoning about complex questions, applying the attention mechanism, and leveraging knowledge graphs (KG) to improve the generated answers. Moreover, the proposed conceptual model includes a zero-shot learning method to allow unlabeled correct answers by implementing a semantic space mapping approach. The use of the fact-based VQA knowledge base for integrating the scene graph with additional information is suggested in the research. It is expected that based on the proposed approach of the framework, its implementation will lead to better accuracy and improvement in efficiency for predicting the appropriate answers.