Skip to main content
Imagine setting up a factory production line where each station (node) performs a specific task, and you connect them to assemble widgets into a final product. This is knowledge pipeline orchestration—a visual workflow builder that allows you to configure data processing sequences through a drag-and-drop interface. It provides control over document ingestion, processing, chunking, indexing, and retrieval strategies. In this section, you’ll learn about the knowledge pipeline process, understand different nodes, how to configure them, and customize your own data processing workflows to efficiently manage and optimize your knowledge base.

Interface Status

When entering the knowledge pipeline orchestration canvas, you’ll see:
  • Tab Status: Documents, Retrieval Test, and Settings tabs will be grayed out and unavailable at the moment
  • Essential Steps: You must complete knowledge pipeline orchestration and publishing before uploading files
Your starting point depends on the template choice you made previously. If you chose Blank Knowledge Pipeline, you’ll see a canvas that contains Knowledge Base node only. There’ll be a note with guide next to the node that walks you through the general steps of pipeline creation. Blank Pipeline If you selected a specific pipeline template, there’ll be a ready-to-use workflow that you can use or modify on the canvas right away. Template Pipeline

The Complete Knowledge Pipeline Process

Before we get started, let’s break down the knowledge pipeline process to understand how your documents are transformed into a searchable knowledge base. The knowledge pipeline includes these key steps:
Data Source → Data Processing (Extractor + Chunker) → Knowledge Base Node (Chunk Structure + Retrieval Setting) → User Input Field → Test & Publish
  1. Data Source: Content from various data sources (local files, Notion, web pages, etc.)
  2. Data Processing: Process and transform data content
    • Extractor: Parse and structure document content
    • Chunker: Split structured content into manageable segments
  3. Knowledge Base: Set up chunk structure and retrieval settings
  4. User Input Field: Define parameters that pipeline users need to input for data processing
  5. Test & Publish: Validate and officially activate the knowledge base

Step 1: Data Source

In a knowledge base, you can choose single or multiple data sources. Currently, Dify supports 4 types of data sources: file upload, online drive, online documents, and web crawler. Visit the Dify Marketplace for more data sources.

File Upload

Upload local files through drag-and-drop or file selection.
Configuration Options
ItemDescription
File FormatSupport PDF, XLSX, DOCX, etc. Users can customize their selection
Upload MethodUpload local files or folders through drag-and-drop or file selection. Batch upload is supported.
Limitations
ItemDescription
File QuantityMaximum 50 files per upload
File SizeEach file must not exceed 15MB
StorageLimits on total document uploads and storage space may vary for different SaaS subscription plans
Output Variables
Output VariableFormat
{x} DocumentSingle document

Online Document

Notion

Integrate with your Notion workspace to seamlessly import pages and databases, always keeping your knowledge base automatically updated.
Notion
Configuration Options
ItemOptionOutput VariableDescription
ExtractorEnabled{x} ContentStructured and processed information
Disabled{x} DocumentOriginal text

Web Crawler

Transform web content into formats that can be easily read by large language models. The knowledge base supports Jina Reader and Firecrawl.

Jina Reader

An open-source web parsing tool providing simple and easy-to-use API services, suitable for fast crawling and processing web content.
Jina Reader
Parameter Configuration
ParameterTypeDescription
URLRequiredTarget webpage address
Crawl sub-pageOptionalWhether to crawl linked pages
Use sitemapOptionalCrawl by using website sitemap
LimitRequiredSet maximum number of pages to crawl
Enable ExtractorOptionalChoose data extraction method

Firecrawl

An open-source web parsing tool that provides more refined crawling control options and API services. It supports deep crawling of complex website structures, recommended for batch processing and precise control.
Parameter Configuration
ParameterTypeDescription
URLRequiredTarget webpage address
LimitRequiredSet maximum number of pages to crawl
Crawl sub-pageOptionalWhether to crawl linked pages
Max depthOptionalHow many levels deep the crawler will traverse from the starting URL
Exclude pathsOptionalSpecify URL patterns that should not be crawled
Include only pathsOptionalCrawl specified paths only
ExtractorOptionalChoose data processing method
Extract Only Main ContentOptionalIsolate and retrieve the primary, meaningful text and media from a webpage

Online Drive

Connect your online cloud storage services (e.g., Google Drive, Dropbox, OneDrive) and let Dify automatically retrieve your files. Simply select and import the documents you need for processing, without manually downloading and re-uploading files.
Need help with authorization? Please check Authorize Data Source for detailed guidance on authorizing different data sources.

Step 2: Set Up Data Processing Tools

In this stage, these tools extract, chunk, and transform the content for optimal knowledge base storage and retrieval. Think of this step like meal preparation. We clean raw materials up, chop them into bite-sized pieces, and organize everything, so the dish can be cooked up quickly when someone orders it.

Doc Processor

Documents come in different formats - PDF, XLSX, DOCX. However, LLM can’t read these files directly. That’s where extractors come in. They support multiple formats and handle the conversion, so your content is ready for the next step of the LLMs. You can choose Dify’s Doc Extractor to process files, or select tools based on your needs from Marketplace which offers Dify Extractor and third-party tools such as Unstructured.

Doc Extractor

As an information processing center, document extractor node identifies and reads files from input variables, extracts information, and finally converts them into a format that works with the next node.
For more information, please refer to the Document Extractor.

Dify Extractor

Dify Extractor is a built-in document parser presented by Dify. It supports multiple common file formats and is specially optimized for Doc files. It can extract and store images from documents and return image URLs. Dify Extractor

Unstructured

Unstructured
Unstructured transforms documents into structured, machine-readable formats with highly customizable processing strategies. It offers multiple extraction strategies (auto, hi_res, fast, OCR-only) and chunking methods (by_title, by_page, by_similarity) to handle diverse document types, offering detailed element-level metadata including coordinates, confidence scores, and layout information. It’s recommended for enterprise document workflows, processing of mixed file types, and cases that require precise control over document processing parameters.
Explore more tools in the Dify Marketplace.

Chunker

Similar to human limited attention span, large language models cannot process huge amount of information simultaneously. Therefore, after information extraction, the chunker splits large document content into smaller and manageable segments (called “chunks”). Different documents require different chunking strategies. A product manual works best when split by product features, while research papers should be divided by logical sections. Dify offers 3 types of chunkers for various document types and use cases.

Overview of Different Chunkers

Chunker TypeHighlightsBest for
General ChunkerFixed-size chunks with customizable delimitersSimple documents with basic structure
Parent-child ChunkerDual-layer structure: precise matching + rich contextComplex documents requiring rich context preservation
Q&A ProcessorProcesses question-answer pairs from spreadsheetsStructured Q&A data from CSV/Excel files

Common Text Pre-processing Rules

All chunkers support these text cleaning options:
Preprocessing OptionDescription
Replace consecutive spaces, newlines and tabsClean up formatting by replacing multiple whitespace characters with single spaces
Remove all URLs and email addressesAutomatically detect and remove web links and email addresses from text

General Chunker

Basic document chunking processing, suitable for documents with relatively simple structures. You can configure text chunking and text preprocessing rules according to the following configuration. Input and Output Variable
TypeVariableDescription
Input Variable{x} ContentComplete document content that the chunker will split into smaller segments
Output Variable{x} Array[Chunk]Array of chunked content, each segment optimized for retrieval and analysis
Chunk Settings
Configuration ItemDescription
DelimiterDefault value is \n (line breaks for paragraph segmentation). You can customize chunking rules following regex. The system will automatically execute segmentation when the delimiter appears in text.
Maximum Chunk LengthSpecifies the maximum character limit within a segment. When this length is exceeded, forced segmentation will occur.
Chunk OverlapWhen segmenting data, there is some overlap between segments. This overlap helps improve information retention and analysis accuracy, enhancing recall effectiveness.

Parent-child Chunker

By using a dual-layer segmentation structure to resolve the contradiction between context and accuracy, parent-child clunker achieves the balance between precise matching and comprehensive contextual information in Retrieval Augmented Generation (RAG) systems. How Parent-child Chunker Works Child Chunks for query matching: Small, precise information segments (usually single sentences) to match user queries with high accuracy. Parent Chunks provide rich context: Larger content blocks (paragraphs, sections, or entire documents) that contain the matching child chunks, giving the large language model (LLM) comprehensive background information.
TypeVariableDescription
Input Variable{x} ContentComplete document content that the chunker will split into smaller segments
Output Variable{x} Array[ParentChunk]Array of parent chunks
Chunk Settings
Configuration ItemDescription
Parent DelimiterSet delimiter for parent chunk splitting
Parent Maximum Chunk LengthSet maximum character count for parent chunks
Child DelimiterSet delimiter for child chunk splitting
Child Maximum Chunk LengthSet maximum character count for child chunks
Parent ModeChoose between Paragraph (split text into paragraphs) or “Full Document” (use entire document as parent chunk) for direct retrieval

Q&A Processor

Combining extraction and chunking in one node, Q&A Processor is specifically designed for structured Q&A datasets from CSV and Excel files. Perfect for FAQ lists, shift schedules, and any spreadsheet data with clear question-answer pairs. Input and Output Variable
TypeVariableDescription
Input Variable{x} DocumentA single file
Output Variable{x} Array[QAChunk]QA chunk
Variable Configuration
Configuration ItemDescription
Column Number for QuestionSet content column as question
Column Number for AnswerSet column answer as answer

Step 3: Configure Knowledge Base Node

Now that your documents are processed and chunked, it’s time to set up how they’ll be stored and retrieved. Here, you can select different indexing methods and retrieval strategies based on your specific needs. Knowledge base node configuration includes: Input Variable, Chunk Structure, Index Method, and Retrieval Settings.

Chunk Structure

Chunk Structure Chunk structure determines how the knowledge base organizes and indexes your document content. Choose the structure mode that best fits your document type, use case, and cost. The knowledge base supports three chunk modes: General Mode, Parent-child Mode, and Q&A Mode. If you’re creating a knowledge base for the first time, we recommend choosing Parent-child Mode.
Important Reminder: Chunk structure cannot be modified once saved and published. Please choose carefully.

General Mode

Suitable for most standard document processing scenarios. It provides flexible indexing options—you can choose appropriate indexing methods based on different quality and cost requirements. General mode supports both high-quality and economical indexing methods, as well as various retrieval settings.

Parent-child Mode

It provides precise matching and corresponding contextual information during retrieval, suitable for professional documents that need to maintain complete context. Parent-child mode supports HQ (High Quality) mode only, offering child chunks for query matching and parent chunks for contextual information during retrieval.

Q&A Mode

Create documents that pair questions with answers when using structured question-answer data. These documents are indexed based on the question portion, enabling the system to retrieve relevant answers based on query similarity. Q&A Mode supports HQ (High Quality) mode only.

Input Variable

Input variables receive processing results from data processing nodes as the data source for knowledge base. You need to connect the output from chunker to the knowledge base as input. The node supports different types of standard inputs based on the selected chunk structure:
  • General Mode: x Array[Chunk] - General chunk array
  • Parent-child Mode: x Array[ParentChunk] - Parent chunk array
  • Q&A Mode: x Array[QAChunk] - Q&A chunk array

Index Method & Retrieval Setting

The index method determines how your knowledge base builds content indexes, while retrieval settings provide corresponding retrieval strategies based on the selected index method. Think of it in this way: index method determines how to organize your documents, while retrieval settings tell users what methods they can use to find documents. The knowledge base provides two index methods: High Quality and Economy, each offering different retrieval setting options. High quality mode uses embedding models to convert segmented text blocks into numerical vectors, helping to compress and store large amounts of text information more effectively. This enables the system to find semantically relevant accurate answers even when the user’s question wording doesn’t exactly match the document. In economy mode, each block uses 10 keywords for retrieval without calling embedding models, generating no costs.
Please refer to Select the Indexing Method and Retrieval Setting for more details.

Index Methods and Retrieval Settings

Index MethodAvailable Retrieval SettingsDescription
High QualityVector RetrievalUnderstand deeper meaning of queries based on semantic similarity
Full-text RetrievalKeyword-based retrieval providing comprehensive search capabilities
Hybrid RetrievalCombine both semantic and keywords
EconomyInverted IndexCommon search engine retrieval method, matches queries with key content
You can also refer to the table below for information on configuring chunk structure, indexing methods, parameters, and retrieval settings.
Chunk StructureIndex MethodsParametersRetrieval Settings
General modeHigh Quality


Economy
Embedding Model


Number of Keywords
Vector Retrieval
Full-text Retrieval
Hybrid Retrieval
Inverted Index
Parent-child ModeHigh Quality OnlyEmbedding ModelVector Retrieval
Full-text Retrieval
Hybrid Retrieval
Q&A ModeHigh Quality OnlyEmbedding ModelVector Retrieval
Full-text Retrieval
Hybrid Retrieval

Step 4: Create User Input Form

User input forms are essential for collecting the initial information your pipeline needs to run effectively. Similar to start node in workflow, this form gathers necessary details from users - such as files to upload, specific parameters for document processing - ensuring your pipeline has all the information it needs to deliver accurate results. This way, you can create specialized input forms for different use scenarios, improving pipeline flexibility and usability for various data sources or document processing steps.

Create User Input Form

There’re two ways to create user input field:
  1. Pipeline Orchestration Interface
    Click on the Input field to start creating and configuring input forms.\
  2. Node Parameter Panel
    Select a node. Then, in parameter input on the right-side panel, click + Create user input for new input items. New input items will also be collected in the Input Field.

Add User Input Fields

Unique Inputs for Each Entrance

These inputs are specific to each data source and its downstream nodes. Users only need to fill out these fields when selecting the corresponding data source, such as different URLs for different data sources. How to create: Click the + button on the right side of a data source to add fields for that specific data source. These fields can only be referenced by that data source and its subsequently connected nodes.

Global Inputs for All Entrances

Global shared inputs can be referenced by all nodes. These inputs are suitable for universal processing parameters, such as delimiters, maximum chunk length, document processing configurations, etc. Users need to fill out these fields regardless of which data source they choose. How to create: Click the + button on the right side of Global Inputs to add fields that can be referenced by any node.

Supported Input Field Types

The knowledge pipeline supports seven types of input variables:
Field TypeDescription
TextShort text input by knowledge base users, maximum length 256 characters
ParagraphLong text input for longer character strings
SelectFixed options preset by the orchestrator for users to choose from, users cannot add custom content
BooleanOnly true/false values
NumberOnly accepts numerical input
SingleUpload a single file, supports multiple file types (documents, images, audio, and other file types)
File ListBatch file upload, supports multiple file types (documents, images, audio, and other file types)
For more information about supported field types, please refer to the Input Fields documentation.

Field Configuration Options

All input field types include: required, optional, and additional settings. You can set whether fields are required by checking the appropriate option.
SettingNameDescriptionExample
Required SettingsVariable NameInternal system identifier, usually named using English and underscoresuser_email
Display NameInterface display name, usually concise and readable textUser Email
Type-specific SettingsSpecial requirements for different field typesText field max length 100 characters
Additional SettingsDefault ValueDefault value when user hasn’t provided inputNumber field defaults to 0, text field defaults to empty
PlaceholderHint text displayed when input box is empty”Please enter your email”
TooltipExplanatory text to guide user input, usually displayed on mouse hover”Please enter a valid email address”
Special Optional SettingsAdditional setting options based on different field typesValidation of email format
After completing configuration, click the preview button in the upper right corner to browse the form preview interface. You can drag and adjust field groupings. If an exclamation mark appears, it indicates that the reference is invalid after moving.

Step 5: Name the Knowledge Base

Name Knowledge Base By default, the knowledge base name will be “Untitled + number”, permissions are set to “Only me”, and the icon will be an orange book. If you import it from a DSL file, it will use the saved icon. Edit knowledge base inforamtion by clicking Settings in the left panel and fill in the information below:
  • Name & Icon
    Pick a name for your knowledge base.
    Choose an emoji, upload an image, or paste an image URL as the icon of this knowledge base.
  • Knowledge Description Provide a brief description of your knowledge base. This helps the AI better understand and retrieve your data. If left empty, Dify will apply the default retrieval strategy.
  • Permissions
    Select the appropriate access permissions from the dropdown menu.

Step 6: Testing

You’re almost there! This is the final step of the knowledge pipeline orchestration. After completing the orchestration, you need to validate all the configuration first. Then, do some running tests and confirm all the settings. Finally, publish the knowledge pipeline.

Configuration Completeness Check

Before testing, it’s recommended to check the completeness of your configuration to avoid test failures due to missing configurations. Click the checklist button in the upper right corner, and the system will display any missing parts. After completing all configurations, you can preview the knowledge base pipeline’s operation through test runs, confirm that all settings are accurate, and then proceed with publishing.

Test Run

  1. Start Test: Click the “Test Run” button in the upper right corner
  2. Import Test File: Import files in the data source window that pops up on the right
Important Note: For better debugging and observation, only one file upload is allowed per test run.
  1. Fill Parameters: After successful import, fill in corresponding parameters according to the user input form you configured earlier
  2. Start Test Run: Click next step to start testing the entire pipeline
During testing, you can access History Logs (track all run records with timestamps, execution status, and input/output summaries) and Variable Inspector (a dashboard at the bottom showing input/output data for each node to help identify issues and verify data flow) for efficient troubleshooting and error fixing. Testing Tools