Advanced pipeline
Last updated
Last updated
An advanced pipeline is a special type of pipeline that lets you combine multiple operators and modeling pipelines to create complex data transformation workflows.
Operators are the fundamental units used by the Recurve orchestrator to perform tasks. In a modeling pipeline, each model is an operator that transforms data by running SQL queries. In an advanced pipeline, however, you can include other types of operators capable of handling broader tasks, such as:
Ingesting data from a source to a target data warehouse.
Running custom SQL queries for ad-hoc transformations.
Sending notifications at various stages of job execution.
Follow these steps to create an advanced pipeline:
Navigate to Data Design > Pipelines in your Recurve project.
Click the + icon and select Create advanced pipeline.
Enter a name for your pipeline and click Confirm.
An advanced pipeline is created. You’ll now define its nodes and their relationships within the DAG (directed acyclic graph).
Click + Add node. Choose an operator or an existing modeling pipeline.
Each operator has different configuration requirements. See [documentation link] for details.
Define the relationships (execution order) between nodes by clicking and dragging arrows between them.
SQL Operator executes custom queries directly against your data sources. While SQL models focus on transformation logic and dependencies, SQL Operator is ideal for cleaning and preparing data before it enters your modeling pipeline.
Configuration:
Data Source: the target data source. These are the destinations that you've configured in your Recurve organization.
Database: the specifc database that you want to run the query.
SQL Query: the SQL query script, where can you write multiple query statements.
For example, you can use it to clean source tables before transforming them in downstream SQL models.
The Transfer Operator allows you to move data from one location to another. For example, you can use it to transfer data from BigQuery to Postgres or from Amazon S3 to Google Cloud Storage.
The operator consists of two tasks: Dump and Load. The Dump task fetches data from a source, while the Load task transfers it to a destination. The configuration, authentication, and parameters vary depending on the source and destination.
For example, we use Transfer operator to move data from Google Sheet into our Postgres database, then processs the data in a modeling pipeline:
The Python Operator allows you to integrate Python code into data pipelines. This operator is particularly useful in scenarios such as:
Data Processing: Performing transformations or calculations on datasets.
API Interactions: Making API calls to external services or databases.
Custom Logic Implementation: Executing business logic that may not fit into standard operators like Bash or SQL operators.