1. Prerequisites
1.1 Knowledge Graph

Figure 1. Examples of different categories’ knowledge graphs, i.e., encyclopedic KGs, commonsense KGs, domain-specific KGs, and multi-modal KGs. (from ref. 1)
The Knowledge graph (KG) [ref. 1, 2, 3] is a widely applied knowledge representation in real-world applications. It’s renowned for its structured and interpretable approach to organize facts, including entities and entity semantic relations. KGs store structured knowledge as a collection of triplets, , where and denote the set of entities and the set of entity relations.
As shown in figure 1, in general, knowledge graphs can be grouped into four categories [ref. 1]:
Encyclopedic Knowledge Graph: It stores general knowledge of the real world, characterized by diverse and extensive information sources.
Commonsense Knowledge Graph: It formulates the knowledge about daily concepts.
Domain-specific Knowledge Graph: It represents knowledge in specific domains, such as medical, biology, and finance.
Multi-modal Knowledge Graph: Besides textual information, it also includes facts in other modalities such as images, sounds, and videos.

Figure 2. KG Research Topics (from ref. 2)
Fig 2 illustrates general research topics/tasks around KG [ref. 2],
Knowledge Graph Embeddings: It aims to map each entity and relation in KGs into low-dimensional continuous vector space so that it captures the semantics and the structure of the knowledge graph efficiently. The learned embeddings can be applied in various downstream applications, such as link/relation prediction, entity prediction, and semantic search etc.
Knowledge Graph Completion: It aims to infer the missing information in a given knowledge graph.
Knowledge Graph Construction/Knowledge Acquisition: It aims to develop a structured representation of knowledge within a specific domain by identifying entities and their relationships. It typically involves 3 stages: entity discovery, coreference resolution and relation extraction.
Knowledge Graph Fusion: When the data sources are different, the entity alignment is crucial in constructing a capable knowledge graph.
Knowledge Reasoning: it aims to infer new facts to enrich the knowledge graph based on existing knowledge. Specifically, it derives relationships between previously unconnected entities. It also helps identify erroneous knowledge by reasoning out false facts. (Note: the reasoning here is different)
AI Systems: KGs have been widely applied in recommender systems, question-answering systems, and information retrieval tools to improve the performance.
Knowledge Graph Applications: KGs are widely applied across diverse fields, such as education, scientific research, social media, and healthcare.
1.2 LLMs and KGs
Figure 3. Pros and cons of LLMs and KGs (from ref. 1)
Despite their success in many applications, LLMs have been criticized for their factual accuracy.
Lacking Domain-specific/New Knowledge: LLMs can only be as good as their training data, and the factual knowledge memorized by LLMs is limited by training corpus.
Hallucination: LLMs are not always reliable in recalling knowledge accurately and may produce hallucinated information.
Implicit Knowledge/Black-box: The knowledge is encoded implicitly within model parameters, functioning as a black box, which lacks transparency for interpreting and validating the knowledge derived from LLMs.
Indecisiveness: LLMs perform reasoning based on probability models, which can be indecisive. And the reasoning patterns to reach at decisions are not directly accessible or explainable to humans.
KGs turns out to be a potential solution, helping mitigate mentioned drawbacks of LLMs.
Structural Knowledge: Knowledge or facts are stored explicitly as triplets in KGs.
Accuracy: KGs act as a reliable reference for verifying knowledge and facts.
Decisiveness: The factual reasoning is straightforward and decisive in KGs.
Interpretability: The decision making patterns are accessible and explainable by using subgraphs in KGs.
Domain-specific Knowledge: Domain-specific knowledge can be easily encoded with KGs.
Evolving Knowledge: KGs can be efficiently maintained and updated using Create, Read, Update, and Delete (CRUD) operations, ensuring the evolution of encoded knowledge.
However, at the same time, KGs have their own challenges.
Incompleteness: KGs are difficult to construct, and handling the incomplete and dynamically changing real-world knowledge is challenging.
Unseen Facts: KGs are customized for specific tasks, but their ability to generalize to unseen facts is often limited. This constraint arises because KGs rely on explicitly encoded triplets, which may not comprehensively cover new scenarios.
Lack Language Understanding: KGs often ignore the abundant textual information in KGs.

Figure 4. LLMs Meet KGs (from ref. 1)
With all mentioned above, Fig. 4 demonstrates how LLMs and KGs can be integrated with each other to improve existing tasks for LLMs and KGs, including KG-enhanced LLMs, LLM-augmented KGs, and synergized LLMs+KGs.

Figure 5. Unifying LLMs and KGs
Fig. 5 illustrates how those three categories work:
KG-enhanced LLMs: KGs are utilized for LLM tasks by providing a structured format of the knowledge. They can be applied during either pretraining or inference stage.
LLM-augmented KGs: LLMs are utilized for KG tasks by offering advanced abilities of natural language understanding and semantic embedding encoding. This enables the automatic and efficient construction and completion of KGs.
Synergized LLMs+KGs: LLMs and KGs can be synergized to tackle more complex tasks that encompass both LLM and KG task objectives.
2. KG Utilizations
2.1 Synthetic Data Pipeline
The main objective of SDP is to synthesize data examples that resemble real-world distributions. However, since real-world distributions cannot be estimated directly, the focus is on producing diverse and high-quality examples. Our current method involves using a fixed set of prompts to extract knowledge embedded in LLMs. As noted earlier, the knowledge derived from LLMs may be unreliable and biased. To enhance the specificity of the generated examples, prompts are enriched with data seeds [ref. 4, 5], which provide additional context and inspirations for generation. Nonetheless, controlling the level of detail remains challenging, as the data seeds are expressed in unstructured natural language, limiting control over which facts are incorporated during the generation process.
To address these issues, KGs can be employed. The main idea is to make the knowledge extraction process more independent from the generative LLM, leveraging LLMs solely for their KG-to-text capabilities. Unlike the previous implicit approach, earlier steps focus entirely on explicit KG construction and completion before the final generation phase. This approach enhances the controllability, interpretability and evaluation of the generation process.
Controllability: The complexity and specificity of the generated example can be controlled by the complexity of the constructed KG, such as the number/ratio of/between nodes and edges.
Interpretability: During the graph construction and completion, it is straightforward to check integrated facts from data seeds by examining the graph structural changes.
Evaluation: The generated example can be transformed back into a KG to verify consistency between the original and reconstructed versions.
2.2 Policy Compliance Verification
KGs have been applied in policy document analysis [ref. 6, 7] to represent statements as KGs. The proof-of-concept prototype presented by [ref. 8] shows that once the taxonomy of entities and relations for regulated activities is defined and the formal framework is built, First Order Logic can be applied, enabling compliance verification algorithms.
Compared to our current approach, LLM as a classifier, by using KGs, it potentially can offer
Enhanced Verification Accuracy: The compliance verification results are obtained from deterministic and strict graph algorithms.
Enhanced Reasoning Accuracy: The reasoning is based on the observations from KGs, such as identified structural discrepancies between a company privacy policy and privacy laws or regulations.
Enhanced Adaptability: The established KGs for different tasks can be easily updated to adapt to new information, ensuring relevance and accuracy over time.
2.3 Intelligent Chunking
Chunking strategies [ref. 9] have been shown to significantly impact retrieval performance, which in turn affects the performance of downstream applications. The naive chunking approach divides documents based solely on chunk size, making it context- and task-agnostic. In contrast, the context-aware chunking approach segments documents according to their structure and adds contextual information to each chunk, enhancing the completeness of each section. However, it remains task-agnostic, meaning it doesn't consider how the chunks will be used in downstream applications. Often, document utilization is predefined during drafting, allowing for chunking based on specific utilization purposes. In cases where the document structure is complex and multiple chunks from different sections are needed to complete a task, the relationships between these chunks become critical. To capture these relationships, KGs can be used to organize chunking results with task-specific information.
3. Challenges
KG schemas can vary depending on the scenario and often require significant manual effort to design effective schemas or establish schema generation rules.
The KG-to-text and text-to-KG capabilities of LLMs remain underexplored in our research.
4. Takeaway
Knowledge Graphs (KGs) offer a structured and interpretable way to organize knowledge, making them highly useful for real-world applications across various domains. They enhance the reliability, accuracy, and transparency of knowledge representation, addressing some of the limitations of Large Language Models (LLMs), such as hallucinations and lack of domain-specific knowledge. Integrating KGs with LLMs can potentially enable improved performance in tasks like synthetic data generation, policy compliance verification, and intelligent chunking. However, challenges remain, including schema design complexities and the underexplored capabilities of KG-to-text and text-to-KG conversions in LLMs.
Reference
[1] Unifying Large Language Models and Knowledge Graphs: A Roadmap, https://arxiv.org/pdf/2306.08302
[2] Knowledge Graphs: Opportunities and Challenges, https://arxiv.org/pdf/2303.13948
[3] Knowledge Graphs, https://arxiv.org/pdf/2003.02320
[4] Magicoder: Empowering Code Generation with OSS-Instruct, https://arxiv.org/abs/2312.02120
[5] AgentInstruct: Toward Generative Teaching with Agentic Flows, https://arxiv.org/html/2407.03502v1
[6] PoliGraph: Automated Privacy Policy Analysis using Knowledge Graphs, https://www.usenix.org/conference/usenixsecurity23/presentation/cui
[7] A CASE STUDY FOR COMPLIANCE AS CODE WITH GRAPHS AND LANGUAGE MODELS: PUBLIC RELEASE OF THE REGULATORY KNOWLEDGE GRAPH, https://arxiv.org/abs/2302.01842
[8] Verifying compliance in process choreographies: Foundations, algorithms, and implementation, https://arxiv.org/abs/2110.09399
[9] Evaluating Chunking Strategies for Retrieval, https://research.trychroma.com/evaluating-chunking
1. Prerequisites
1.1 Knowledge Graph

Figure 1. Examples of different categories’ knowledge graphs, i.e., encyclopedic KGs, commonsense KGs, domain-specific KGs, and multi-modal KGs. (from ref. 1)
The Knowledge graph (KG) [ref. 1, 2, 3] is a widely applied knowledge representation in real-world applications. It’s renowned for its structured and interpretable approach to organize facts, including entities and entity semantic relations. KGs store structured knowledge as a collection of triplets, , where and denote the set of entities and the set of entity relations.
As shown in figure 1, in general, knowledge graphs can be grouped into four categories [ref. 1]:
Encyclopedic Knowledge Graph: It stores general knowledge of the real world, characterized by diverse and extensive information sources.
Commonsense Knowledge Graph: It formulates the knowledge about daily concepts.
Domain-specific Knowledge Graph: It represents knowledge in specific domains, such as medical, biology, and finance.
Multi-modal Knowledge Graph: Besides textual information, it also includes facts in other modalities such as images, sounds, and videos.

Figure 2. KG Research Topics (from ref. 2)
Fig 2 illustrates general research topics/tasks around KG [ref. 2],
Knowledge Graph Embeddings: It aims to map each entity and relation in KGs into low-dimensional continuous vector space so that it captures the semantics and the structure of the knowledge graph efficiently. The learned embeddings can be applied in various downstream applications, such as link/relation prediction, entity prediction, and semantic search etc.
Knowledge Graph Completion: It aims to infer the missing information in a given knowledge graph.
Knowledge Graph Construction/Knowledge Acquisition: It aims to develop a structured representation of knowledge within a specific domain by identifying entities and their relationships. It typically involves 3 stages: entity discovery, coreference resolution and relation extraction.
Knowledge Graph Fusion: When the data sources are different, the entity alignment is crucial in constructing a capable knowledge graph.
Knowledge Reasoning: it aims to infer new facts to enrich the knowledge graph based on existing knowledge. Specifically, it derives relationships between previously unconnected entities. It also helps identify erroneous knowledge by reasoning out false facts. (Note: the reasoning here is different)
AI Systems: KGs have been widely applied in recommender systems, question-answering systems, and information retrieval tools to improve the performance.
Knowledge Graph Applications: KGs are widely applied across diverse fields, such as education, scientific research, social media, and healthcare.
1.2 LLMs and KGs
Figure 3. Pros and cons of LLMs and KGs (from ref. 1)
Despite their success in many applications, LLMs have been criticized for their factual accuracy.
Lacking Domain-specific/New Knowledge: LLMs can only be as good as their training data, and the factual knowledge memorized by LLMs is limited by training corpus.
Hallucination: LLMs are not always reliable in recalling knowledge accurately and may produce hallucinated information.
Implicit Knowledge/Black-box: The knowledge is encoded implicitly within model parameters, functioning as a black box, which lacks transparency for interpreting and validating the knowledge derived from LLMs.
Indecisiveness: LLMs perform reasoning based on probability models, which can be indecisive. And the reasoning patterns to reach at decisions are not directly accessible or explainable to humans.
KGs turns out to be a potential solution, helping mitigate mentioned drawbacks of LLMs.
Structural Knowledge: Knowledge or facts are stored explicitly as triplets in KGs.
Accuracy: KGs act as a reliable reference for verifying knowledge and facts.
Decisiveness: The factual reasoning is straightforward and decisive in KGs.
Interpretability: The decision making patterns are accessible and explainable by using subgraphs in KGs.
Domain-specific Knowledge: Domain-specific knowledge can be easily encoded with KGs.
Evolving Knowledge: KGs can be efficiently maintained and updated using Create, Read, Update, and Delete (CRUD) operations, ensuring the evolution of encoded knowledge.
However, at the same time, KGs have their own challenges.
Incompleteness: KGs are difficult to construct, and handling the incomplete and dynamically changing real-world knowledge is challenging.
Unseen Facts: KGs are customized for specific tasks, but their ability to generalize to unseen facts is often limited. This constraint arises because KGs rely on explicitly encoded triplets, which may not comprehensively cover new scenarios.
Lack Language Understanding: KGs often ignore the abundant textual information in KGs.

Figure 4. LLMs Meet KGs (from ref. 1)
With all mentioned above, Fig. 4 demonstrates how LLMs and KGs can be integrated with each other to improve existing tasks for LLMs and KGs, including KG-enhanced LLMs, LLM-augmented KGs, and synergized LLMs+KGs.

Figure 5. Unifying LLMs and KGs
Fig. 5 illustrates how those three categories work:
KG-enhanced LLMs: KGs are utilized for LLM tasks by providing a structured format of the knowledge. They can be applied during either pretraining or inference stage.
LLM-augmented KGs: LLMs are utilized for KG tasks by offering advanced abilities of natural language understanding and semantic embedding encoding. This enables the automatic and efficient construction and completion of KGs.
Synergized LLMs+KGs: LLMs and KGs can be synergized to tackle more complex tasks that encompass both LLM and KG task objectives.
2. KG Utilizations
2.1 Synthetic Data Pipeline
The main objective of SDP is to synthesize data examples that resemble real-world distributions. However, since real-world distributions cannot be estimated directly, the focus is on producing diverse and high-quality examples. Our current method involves using a fixed set of prompts to extract knowledge embedded in LLMs. As noted earlier, the knowledge derived from LLMs may be unreliable and biased. To enhance the specificity of the generated examples, prompts are enriched with data seeds [ref. 4, 5], which provide additional context and inspirations for generation. Nonetheless, controlling the level of detail remains challenging, as the data seeds are expressed in unstructured natural language, limiting control over which facts are incorporated during the generation process.
To address these issues, KGs can be employed. The main idea is to make the knowledge extraction process more independent from the generative LLM, leveraging LLMs solely for their KG-to-text capabilities. Unlike the previous implicit approach, earlier steps focus entirely on explicit KG construction and completion before the final generation phase. This approach enhances the controllability, interpretability and evaluation of the generation process.
Controllability: The complexity and specificity of the generated example can be controlled by the complexity of the constructed KG, such as the number/ratio of/between nodes and edges.
Interpretability: During the graph construction and completion, it is straightforward to check integrated facts from data seeds by examining the graph structural changes.
Evaluation: The generated example can be transformed back into a KG to verify consistency between the original and reconstructed versions.
2.2 Policy Compliance Verification
KGs have been applied in policy document analysis [ref. 6, 7] to represent statements as KGs. The proof-of-concept prototype presented by [ref. 8] shows that once the taxonomy of entities and relations for regulated activities is defined and the formal framework is built, First Order Logic can be applied, enabling compliance verification algorithms.
Compared to our current approach, LLM as a classifier, by using KGs, it potentially can offer
Enhanced Verification Accuracy: The compliance verification results are obtained from deterministic and strict graph algorithms.
Enhanced Reasoning Accuracy: The reasoning is based on the observations from KGs, such as identified structural discrepancies between a company privacy policy and privacy laws or regulations.
Enhanced Adaptability: The established KGs for different tasks can be easily updated to adapt to new information, ensuring relevance and accuracy over time.
2.3 Intelligent Chunking
Chunking strategies [ref. 9] have been shown to significantly impact retrieval performance, which in turn affects the performance of downstream applications. The naive chunking approach divides documents based solely on chunk size, making it context- and task-agnostic. In contrast, the context-aware chunking approach segments documents according to their structure and adds contextual information to each chunk, enhancing the completeness of each section. However, it remains task-agnostic, meaning it doesn't consider how the chunks will be used in downstream applications. Often, document utilization is predefined during drafting, allowing for chunking based on specific utilization purposes. In cases where the document structure is complex and multiple chunks from different sections are needed to complete a task, the relationships between these chunks become critical. To capture these relationships, KGs can be used to organize chunking results with task-specific information.
3. Challenges
KG schemas can vary depending on the scenario and often require significant manual effort to design effective schemas or establish schema generation rules.
The KG-to-text and text-to-KG capabilities of LLMs remain underexplored in our research.
4. Takeaway
Knowledge Graphs (KGs) offer a structured and interpretable way to organize knowledge, making them highly useful for real-world applications across various domains. They enhance the reliability, accuracy, and transparency of knowledge representation, addressing some of the limitations of Large Language Models (LLMs), such as hallucinations and lack of domain-specific knowledge. Integrating KGs with LLMs can potentially enable improved performance in tasks like synthetic data generation, policy compliance verification, and intelligent chunking. However, challenges remain, including schema design complexities and the underexplored capabilities of KG-to-text and text-to-KG conversions in LLMs.
Reference
[1] Unifying Large Language Models and Knowledge Graphs: A Roadmap, https://arxiv.org/pdf/2306.08302
[2] Knowledge Graphs: Opportunities and Challenges, https://arxiv.org/pdf/2303.13948
[3] Knowledge Graphs, https://arxiv.org/pdf/2003.02320
[4] Magicoder: Empowering Code Generation with OSS-Instruct, https://arxiv.org/abs/2312.02120
[5] AgentInstruct: Toward Generative Teaching with Agentic Flows, https://arxiv.org/html/2407.03502v1
[6] PoliGraph: Automated Privacy Policy Analysis using Knowledge Graphs, https://www.usenix.org/conference/usenixsecurity23/presentation/cui
[7] A CASE STUDY FOR COMPLIANCE AS CODE WITH GRAPHS AND LANGUAGE MODELS: PUBLIC RELEASE OF THE REGULATORY KNOWLEDGE GRAPH, https://arxiv.org/abs/2302.01842
[8] Verifying compliance in process choreographies: Foundations, algorithms, and implementation, https://arxiv.org/abs/2110.09399
[9] Evaluating Chunking Strategies for Retrieval, https://research.trychroma.com/evaluating-chunking


