Modularity
Kedro
Setting up Kedro in a Conda Environment
conda create --name <ENV_NAME>
conda activate <ENV_NAME>
pip install kedro # not via conda
kedro new
cd <PROJECT_DIR>
kedro build-reqs
Kedro - 6 Months In
Source: https://lou.dev/blog/2020/kedro/
Introduction to Kedro training with Joel Schwarzmann
Source: Introduction to Kedro training with Joel Schwarzmann - YouTube
Kedro - reading/writing from/to S3
- Include
s3fs>=0.3.0,<0.5
insrc/requirements.txt
- Run
kedro build-reqs
from the project’s root directory - Run
pip install -r src/requirements.txt
- Create a dataset in
conf/base/catalog.yml
that points to the location in S3. Add your AWS credentials toconf/local/credentials.yml
and specify the credentials parameter along with the dataset parameter inconf/base/catalog.yml
. Refer to: https://kedro.readthedocs.io/en/stable/data/data_catalog.html#example-4-loads-a-csv-file-from-a-specific-s3-bucket-using-credentials-and-load-arguments - Now, when you run
catalog.load(DATASET_NAME)
, Kedro loads the data from S3 into memory - To save the output of a node as a dataset to S3, specify the name of the output(dataset), S3 path and credentials in
conf/base/catalog.yml
. - When the node is run, Kedro writes the data to S3.
Kedro - Hello World Pipeline
conda create -n <ENV_NAME>
conda activate <ENV_NAME>
pip install kedro
kedro new
(then, enter project name, etc.)cd <KEDRO_PROJECT_DIRECTORY>
kedro build-reqs
pip install -r src/requirements.txt
kedro pipeline create greeting
- Add the following to
conf/base/parameters/greeting.yml
:first_name: John last_name: Doe
cd src/<PROJECT_NAME>/pipelines/greeting
- add the following to
nodes.py
def hello(first_name, last_name): greeting = f"Hello {first_name} {last_name}" print(greeting) return greeting
- add the following to
pipeline.py
```Python from kedro.pipeline import Pipeline, node, pipeline from .nodes import hello
def create_pipeline(**kwargs) -> Pipeline: return pipeline([ node( func=hello, inputs=[“params:first_name”, “params:last_name”], outputs=””, name=”hello” ) ]) ```
cd ../../../..
to get to the Kedro project’s root directorykedro run --pipeline greeting
(can alsokedro run
instead because we only have 1 pipeline here)- Among the log messages, Kedro should also print
Hello John Doe
to the screen.
Things to note:
- The
outputs
argument is mandatory when creating a node.- If you don’t want to save the return value of the node to a variable, specify
outputs=''
- If you don’t want to save the return value of the node to a variable, specify
- Saving
None
to aDataSet
is not allowed.- So, the function within the node should always return something.
- By default, Kedro takes the inputs to the node from the Data Catalog. When we specify
params:<PARAMETER>
, it reads from the parameter file of the pipeline it was called from, which in this case isgreeting