Create a new knowledege base

Step 1. Creating a new knowledge base

Click on Knowledge in the main navigation bar of Dify. On this page, you can see your existing knowledge bases. Click Create Knowledge to enter the setup wizard:

Drag and drop or select files to upload. The number of files allowed for batch upload depends on your subscription plan;
If you have not prepared any documents yet, you can first create an empty knowledge base;
When creating a knowledge base with an external data source (such as Notion or Sync from website), the knowledge base type becomes immutable. This restriction prevents management complexities that could arise from multiple data sources within a single knowledge base.

For scenarios requiring multiple data sources, we recommend creating separate knowledge bases for each source. You can then utilize the Multiple-Retrieval feature to reference multiple knowledge bases within the same application.

Limitations for uploading documents:

The upload size limit for a single document is 15MB;
Different subscription plans for the SaaS version limit batch upload numbers, total document uploads, and vector storage;

step 2. Two strategies are supported:

Automatic mode

The Automated mode is designed for users unfamiliar with segmentation and preprocessing techniques. In this mode, Dify automatically segments and sanitizes content files, streamlining the document preparation process.
Custom mode

Custom mode is tailored for advanced users with specific text processing requirements. This mode allows manual configuration of chunking rules and cleaning strategies based on different document formats and scenario demands.

Chunking Rules:
1. Delimiter: Specify a delimiter for text segmentation. For example, \n (newline character in regex) will chunk text at each line break.
2. Maximum chunk length: Set the maximum character count per segment. Chunk exceeding this limit will be forcibly divided. The maximum length for a segment is 4000 tokens.
3. Chunk overlap: Define the overlap between adjacent chunks. This overlap enhances information retention and analysis accuracy, improving recall effectiveness. Recommended setting is 10-25% of the segment length in tokens.
Text Preprocessing Rules: These rules help filter out insignificant content from the knowledge base.
- Replace consecutive spaces, newlines, and tabs.
- Delete all URLs and email addresses.

Step 3. Choose Indexing Mode

Dify provides three indexing modes:

High-Quality Mode: This mode uses Embedding for vectorization and relies on approximate matching in a vector database for subsequent searches. It consumes a certain amount of tokens.
Economy Mode: This mode builds indexes using traditional keyword search methods, utilizing components similar to Elasticsearch (ES) for searches. It reduces accuracy but does not consume tokens. This inverted index only returns the Top_K results. Interested users can test it themselves.
Q&A Mode (Community Edition only): After the document is segmented, this mode generates Q&A pairs for each segment through summarization. When a user asks a question, the system finds the most similar question and returns the corresponding segment as the answer. This mode is more precise because it directly matches the user's question, allowing for more accurate retrieval of the information the user truly needs

Index Model

High-Quality Indexing supports three types of search settings:

Vector Search
Full-Text Search
Hybrid Search, which combines Vector Search and Full-Text Search.

⚠️Note:
    The Rerank model can significantly improve the accuracy of RAG (Retrieval-Augmented Generation) recall. 
    If a Rerank-supported model is configured on the "Model Provider" page, enabling the "Rerank Model" in 
    the search settings will allow the system to perform semantic reordering of the retrieved document 
    results after the initial semantic search, optimizing the ranking results. After setting the Rerank 
    model, the TopK and Score threshold settings will only take effect during the Rerank step.

Retrieval Settings

Wait for the embedding process to complete, then click "Go to Document" to view the vectorized document. create kb success

Step 4. Document Maintenance (Optional)

After the document processing is complete, we usually need to maintain the document, including viewing text segments, checking segment quality, adding text segments, editing text segments, and managing metadata (metadata will be used in the knowledge base's segment recall process as structured fields to participate in recall filtering or display reference sources). This part is optional in the experiment, please refer to the documentation link for operation experience:

How to Manage Documents

After the knowledge base is built, we can proceed to create a chat assistant application and have the chat assistant respond based on the content of the knowledge base.