Imagine setting up a factory production line where each station (node) performs a specific task, and you connect them to assemble widgets into a final product. This is knowledge pipeline orchestration—a visual workflow builder that allows you to configure data processing sequences through a drag-and-drop interface. It provides control over document ingestion, processing, chunking, indexing, and retrieval strategies.
In this section, you’ll learn about the knowledge pipeline process, understand different nodes, how to configure them, and customize your own data processing workflows to efficiently manage and optimize your knowledge base.
Interface Status
When entering the knowledge pipeline orchestration canvas, you’ll see:
- Tab Status: Documents, Retrieval Test, and Settings tabs will be grayed out and unavailable at the moment
- Essential Steps: You must complete knowledge pipeline orchestration and publishing before uploading files
Your starting point depends on the template choice you made previously. If you chose Blank Knowledge Pipeline, you’ll see a canvas that contains Knowledge Base node only. There’ll be a note with guide next to the node that walks you through the general steps of pipeline creation.
If you selected a specific pipeline template, there’ll be a ready-to-use workflow that you can use or modify on the canvas right away.
The Complete Knowledge Pipeline Process
Before we get started, let’s break down the knowledge pipeline process to understand how your documents are transformed into a searchable knowledge base.
The knowledge pipeline includes these key steps:
Data Source → Data Processing (Extractor + Chunker) → Knowledge Base Node (Chunk Structure + Retrieval Setting) → User Input Field → Test & Publish
- Data Source: Content from various data sources (local files, Notion, web pages, etc.)
- Data Processing: Process and transform data content
- Extractor: Parse and structure document content
- Chunker: Split structured content into manageable segments
- Knowledge Base: Set up chunk structure and retrieval settings
- User Input Field: Define parameters that pipeline users need to input for data processing
- Test & Publish: Validate and officially activate the knowledge base
Step 1: Data Source
In a knowledge base, you can choose single or multiple data sources. Currently, Dify supports 4 types of data sources: file upload, online drive, online documents, and web crawler.
Visit the Dify Marketplace for more data sources.
File Upload
Upload local files through drag-and-drop or file selection.
Configuration Options| Item | Description |
|---|
| File Format | Support PDF, XLSX, DOCX, etc. Users can customize their selection |
| Upload Method | Upload local files or folders through drag-and-drop or file selection. Batch upload is supported. |
Limitations| Item | Description |
|---|
| File Quantity | Maximum 50 files per upload |
| File Size | Each file must not exceed 15MB |
| Storage | Limits on total document uploads and storage space may vary for different SaaS subscription plans |
Output Variables| Output Variable | Format |
|---|
{x} Document | Single document |
Online Document
Notion
Integrate with your Notion workspace to seamlessly import pages and databases, always keeping your knowledge base automatically updated.
Configuration Options| Item | Option | Output Variable | Description |
|---|
| Extractor | Enabled | {x} Content | Structured and processed information |
| Disabled | {x} Document | Original text |
Web Crawler
Transform web content into formats that can be easily read by large language models. The knowledge base supports Jina Reader and Firecrawl.
Jina Reader
An open-source web parsing tool providing simple and easy-to-use API services, suitable for fast crawling and processing web content.
Parameter Configuration| Parameter | Type | Description |
|---|
| URL | Required | Target webpage address |
| Crawl sub-page | Optional | Whether to crawl linked pages |
| Use sitemap | Optional | Crawl by using website sitemap |
| Limit | Required | Set maximum number of pages to crawl |
| Enable Extractor | Optional | Choose data extraction method |
Firecrawl
An open-source web parsing tool that provides more refined crawling control options and API services. It supports deep crawling of complex website structures, recommended for batch processing and precise control.
Parameter Configuration| Parameter | Type | Description |
|---|
| URL | Required | Target webpage address |
| Limit | Required | Set maximum number of pages to crawl |
| Crawl sub-page | Optional | Whether to crawl linked pages |
| Max depth | Optional | How many levels deep the crawler will traverse from the starting URL |
| Exclude paths | Optional | Specify URL patterns that should not be crawled |
| Include only paths | Optional | Crawl specified paths only |
| Extractor | Optional | Choose data processing method |
| Extract Only Main Content | Optional | Isolate and retrieve the primary, meaningful text and media from a webpage |
Online Drive
Connect your online cloud storage services (e.g., Google Drive, Dropbox, OneDrive) and let Dify automatically retrieve your files. Simply select and import the documents you need for processing, without manually downloading and re-uploading files.
Need help with authorization? Please check Authorize Data Source for detailed guidance on authorizing different data sources.
In this stage, these tools extract, chunk, and transform the content for optimal knowledge base storage and retrieval. Think of this step like meal preparation. We clean raw materials up, chop them into bite-sized pieces, and organize everything, so the dish can be cooked up quickly when someone orders it.
Doc Processor
Documents come in different formats - PDF, XLSX, DOCX. However, LLM can’t read these files directly. That’s where extractors come in. They support multiple formats and handle the conversion, so your content is ready for the next step of the LLMs.
You can choose Dify’s Doc Extractor to process files, or select tools based on your needs from Marketplace which offers Dify Extractor and third-party tools such as Unstructured.
As an information processing center, document extractor node identifies and reads files from input variables, extracts information, and finally converts them into a format that works with the next node.
Dify Extractor is a built-in document parser presented by Dify. It supports multiple common file formats and is specially optimized for Doc files. It can extract and store images from documents and return image URLs.
Unstructured
Unstructured transforms documents into structured, machine-readable formats with highly customizable processing strategies. It offers multiple extraction strategies (auto, hi_res, fast, OCR-only) and chunking methods (by_title, by_page, by_similarity) to handle diverse document types, offering detailed element-level metadata including coordinates, confidence scores, and layout information. It’s recommended for enterprise document workflows, processing of mixed file types, and cases that require precise control over document processing parameters.
Chunker
Similar to human limited attention span, large language models cannot process huge amount of information simultaneously. Therefore, after information extraction, the chunker splits large document content into smaller and manageable segments (called “chunks”).
Different documents require different chunking strategies. A product manual works best when split by product features, while research papers should be divided by logical sections. Dify offers 3 types of chunkers for various document types and use cases.
Overview of Different Chunkers
| Chunker Type | Highlights | Best for |
|---|
| General Chunker | Fixed-size chunks with customizable delimiters | Simple documents with basic structure |
| Parent-child Chunker | Dual-layer structure: precise matching + rich context | Complex documents requiring rich context preservation |
| Q&A Processor | Processes question-answer pairs from spreadsheets | Structured Q&A data from CSV/Excel files |
Common Text Pre-processing Rules
All chunkers support these text cleaning options:
| Preprocessing Option | Description |
|---|
| Replace consecutive spaces, newlines and tabs | Clean up formatting by replacing multiple whitespace characters with single spaces |
| Remove all URLs and email addresses | Automatically detect and remove web links and email addresses from text |
General Chunker
Basic document chunking processing, suitable for documents with relatively simple structures. You can configure text chunking and text preprocessing rules according to the following configuration.
Input and Output Variable
| Type | Variable | Description |
|---|
| Input Variable | {x} Content | Complete document content that the chunker will split into smaller segments |
| Output Variable | {x} Array[Chunk] | Array of chunked content, each segment optimized for retrieval and analysis |
Chunk Settings
| Configuration Item | Description |
|---|
| Delimiter | Default value is \n (line breaks for paragraph segmentation). You can customize chunking rules following regex. The system will automatically execute segmentation when the delimiter appears in text. |
| Maximum Chunk Length | Specifies the maximum character limit within a segment. When this length is exceeded, forced segmentation will occur. |
| Chunk Overlap | When segmenting data, there is some overlap between segments. This overlap helps improve information retention and analysis accuracy, enhancing recall effectiveness. |
Parent-child Chunker
By using a dual-layer segmentation structure to resolve the contradiction between context and accuracy, parent-child clunker achieves the balance between precise matching and comprehensive contextual information in Retrieval Augmented Generation (RAG) systems.
How Parent-child Chunker Works
Child Chunks for query matching: Small, precise information segments (usually single sentences) to match user queries with high accuracy.
Parent Chunks provide rich context: Larger content blocks (paragraphs, sections, or entire documents) that contain the matching child chunks, giving the large language model (LLM) comprehensive background information.
| Type | Variable | Description |
|---|
| Input Variable | {x} Content | Complete document content that the chunker will split into smaller segments |
| Output Variable | {x} Array[ParentChunk] | Array of parent chunks |
Chunk Settings
| Configuration Item | Description |
|---|
| Parent Delimiter | Set delimiter for parent chunk splitting |
| Parent Maximum Chunk Length | Set maximum character count for parent chunks |
| Child Delimiter | Set delimiter for child chunk splitting |
| Child Maximum Chunk Length | Set maximum character count for child chunks |
| Parent Mode | Choose between Paragraph (split text into paragraphs) or “Full Document” (use entire document as parent chunk) for direct retrieval |
Q&A Processor
Combining extraction and chunking in one node, Q&A Processor is specifically designed for structured Q&A datasets from CSV and Excel files. Perfect for FAQ lists, shift schedules, and any spreadsheet data with clear question-answer pairs.
Input and Output Variable
| Type | Variable | Description |
|---|
| Input Variable | {x} Document | A single file |
| Output Variable | {x} Array[QAChunk] | QA chunk |
Variable Configuration
| Configuration Item | Description |
|---|
| Column Number for Question | Set content column as question |
| Column Number for Answer | Set column answer as answer |
Now that your documents are processed and chunked, it’s time to set up how they’ll be stored and retrieved. Here, you can select different indexing methods and retrieval strategies based on your specific needs.
Knowledge base node configuration includes: Input Variable, Chunk Structure, Index Method, and Retrieval Settings.
Chunk Structure
Chunk structure determines how the knowledge base organizes and indexes your document content. Choose the structure mode that best fits your document type, use case, and cost.
The knowledge base supports three chunk modes: General Mode, Parent-child Mode, and Q&A Mode. If you’re creating a knowledge base for the first time, we recommend choosing Parent-child Mode.
Important Reminder: Chunk structure cannot be modified once saved and published. Please choose carefully.
General Mode
Suitable for most standard document processing scenarios. It provides flexible indexing options—you can choose appropriate indexing methods based on different quality and cost requirements.
General mode supports both high-quality and economical indexing methods, as well as various retrieval settings.
Parent-child Mode
It provides precise matching and corresponding contextual information during retrieval, suitable for professional documents that need to maintain complete context.
Parent-child mode supports HQ (High Quality) mode only, offering child chunks for query matching and parent chunks for contextual information during retrieval.
Q&A Mode
Create documents that pair questions with answers when using structured question-answer data. These documents are indexed based on the question portion, enabling the system to retrieve relevant answers based on query similarity.
Q&A Mode supports HQ (High Quality) mode only.
Input variables receive processing results from data processing nodes as the data source for knowledge base. You need to connect the output from chunker to the knowledge base as input.
The node supports different types of standard inputs based on the selected chunk structure:
- General Mode: x Array[Chunk] - General chunk array
- Parent-child Mode: x Array[ParentChunk] - Parent chunk array
- Q&A Mode: x Array[QAChunk] - Q&A chunk array
Index Method & Retrieval Setting
The index method determines how your knowledge base builds content indexes, while retrieval settings provide corresponding retrieval strategies based on the selected index method. Think of it in this way: index method determines how to organize your documents, while retrieval settings tell users what methods they can use to find documents.
The knowledge base provides two index methods: High Quality and Economy, each offering different retrieval setting options.
High quality mode uses embedding models to convert segmented text blocks into numerical vectors, helping to compress and store large amounts of text information more effectively. This enables the system to find semantically relevant accurate answers even when the user’s question wording doesn’t exactly match the document.
In economy mode, each block uses 10 keywords for retrieval without calling embedding models, generating no costs.
Index Methods and Retrieval Settings
| Index Method | Available Retrieval Settings | Description |
|---|
| High Quality | Vector Retrieval | Understand deeper meaning of queries based on semantic similarity |
| Full-text Retrieval | Keyword-based retrieval providing comprehensive search capabilities |
| Hybrid Retrieval | Combine both semantic and keywords |
| Economy | Inverted Index | Common search engine retrieval method, matches queries with key content |
You can also refer to the table below for information on configuring chunk structure, indexing methods, parameters, and retrieval settings.
| Chunk Structure | Index Methods | Parameters | Retrieval Settings |
|---|
| General mode | High Quality Economy | Embedding Model Number of Keywords | Vector Retrieval Full-text Retrieval Hybrid Retrieval Inverted Index |
| Parent-child Mode | High Quality Only | Embedding Model | Vector Retrieval Full-text Retrieval Hybrid Retrieval |
| Q&A Mode | High Quality Only | Embedding Model | Vector Retrieval Full-text Retrieval Hybrid Retrieval |
User input forms are essential for collecting the initial information your pipeline needs to run effectively. Similar to start node in workflow, this form gathers necessary details from users - such as files to upload, specific parameters for document processing - ensuring your pipeline has all the information it needs to deliver accurate results.
This way, you can create specialized input forms for different use scenarios, improving pipeline flexibility and usability for various data sources or document processing steps.
There’re two ways to create user input field:
- Pipeline Orchestration Interface
Click on the Input field to start creating and configuring input forms.\
- Node Parameter Panel
Select a node. Then, in parameter input on the right-side panel, click + Create user input for new input items. New input items will also be collected in the Input Field. 
These inputs are specific to each data source and its downstream nodes. Users only need to fill out these fields when selecting the corresponding data source, such as different URLs for different data sources.
How to create: Click the + button on the right side of a data source to add fields for that specific data source. These fields can only be referenced by that data source and its subsequently connected nodes.
Global shared inputs can be referenced by all nodes. These inputs are suitable for universal processing parameters, such as delimiters, maximum chunk length, document processing configurations, etc. Users need to fill out these fields regardless of which data source they choose.
How to create: Click the + button on the right side of Global Inputs to add fields that can be referenced by any node.
The knowledge pipeline supports seven types of input variables:
| Field Type | Description |
|---|
| Text | Short text input by knowledge base users, maximum length 256 characters |
| Paragraph | Long text input for longer character strings |
| Select | Fixed options preset by the orchestrator for users to choose from, users cannot add custom content |
| Boolean | Only true/false values |
| Number | Only accepts numerical input |
| Single | Upload a single file, supports multiple file types (documents, images, audio, and other file types) |
| File List | Batch file upload, supports multiple file types (documents, images, audio, and other file types) |
Field Configuration Options
All input field types include: required, optional, and additional settings. You can set whether fields are required by checking the appropriate option.
| Setting | Name | Description | Example |
|---|
| Required Settings | Variable Name | Internal system identifier, usually named using English and underscores | user_email |
| Display Name | Interface display name, usually concise and readable text | User Email |
| Type-specific Settings | | Special requirements for different field types | Text field max length 100 characters |
| Additional Settings | Default Value | Default value when user hasn’t provided input | Number field defaults to 0, text field defaults to empty |
| Placeholder | Hint text displayed when input box is empty | ”Please enter your email” |
| Tooltip | Explanatory text to guide user input, usually displayed on mouse hover | ”Please enter a valid email address” |
| Special Optional Settings | | Additional setting options based on different field types | Validation of email format |
After completing configuration, click the preview button in the upper right corner to browse the form preview interface. You can drag and adjust field groupings. If an exclamation mark appears, it indicates that the reference is invalid after moving.
Step 5: Name the Knowledge Base
By default, the knowledge base name will be “Untitled + number”, permissions are set to “Only me”, and the icon will be an orange book. If you import it from a DSL file, it will use the saved icon.
Edit knowledge base inforamtion by clicking Settings in the left panel and fill in the information below:
- Name & Icon
Pick a name for your knowledge base.
Choose an emoji, upload an image, or paste an image URL as the icon of this knowledge base.
- Knowledge Description Provide a brief description of your knowledge base. This helps the AI better understand and retrieve your data. If left empty, Dify will apply the default retrieval strategy.
- Permissions
Select the appropriate access permissions from the dropdown menu.
Step 6: Testing
You’re almost there! This is the final step of the knowledge pipeline orchestration.
After completing the orchestration, you need to validate all the configuration first. Then, do some running tests and confirm all the settings. Finally, publish the knowledge pipeline.
Configuration Completeness Check
Before testing, it’s recommended to check the completeness of your configuration to avoid test failures due to missing configurations.
Click the checklist button in the upper right corner, and the system will display any missing parts.
After completing all configurations, you can preview the knowledge base pipeline’s operation through test runs, confirm that all settings are accurate, and then proceed with publishing.
Test Run
- Start Test: Click the “Test Run” button in the upper right corner
- Import Test File: Import files in the data source window that pops up on the right
Important Note: For better debugging and observation, only one file upload is allowed per test run.
- Fill Parameters: After successful import, fill in corresponding parameters according to the user input form you configured earlier
- Start Test Run: Click next step to start testing the entire pipeline
During testing, you can access History Logs (track all run records with timestamps, execution status, and input/output summaries) and Variable Inspector (a dashboard at the bottom showing input/output data for each node to help identify issues and verify data flow) for efficient troubleshooting and error fixing.
