Modularity

Kedro

Setting up Kedro in a Conda Environment

conda create --name <ENV_NAME>
conda activate <ENV_NAME>
pip install kedro # not via conda
kedro new
cd <PROJECT_DIR>
kedro build-reqs

Kedro - 6 Months In

Source: https://lou.dev/blog/2020/kedro/

Introduction to Kedro training with Joel Schwarzmann

Source: Introduction to Kedro training with Joel Schwarzmann - YouTube

Kedro - reading/writing from/to S3

  • Include s3fs>=0.3.0,<0.5 in src/requirements.txt
  • Run kedro build-reqs from the project’s root directory
  • Run pip install -r src/requirements.txt
  • Create a dataset in conf/base/catalog.yml that points to the location in S3. Add your AWS credentials to conf/local/credentials.yml and specify the credentials parameter along with the dataset parameter in conf/base/catalog.yml. Refer to: https://kedro.readthedocs.io/en/stable/data/data_catalog.html#example-4-loads-a-csv-file-from-a-specific-s3-bucket-using-credentials-and-load-arguments
  • Now, when you run catalog.load(DATASET_NAME), Kedro loads the data from S3 into memory
  • To save the output of a node as a dataset to S3, specify the name of the output(dataset), S3 path and credentials in conf/base/catalog.yml.
  • When the node is run, Kedro writes the data to S3.

Kedro - Hello World Pipeline

  • conda create -n <ENV_NAME>
  • conda activate <ENV_NAME>
  • pip install kedro
  • kedro new (then, enter project name, etc.)
  • cd <KEDRO_PROJECT_DIRECTORY>
  • kedro build-reqs
  • pip install -r src/requirements.txt
  • kedro pipeline create greeting
  • Add the following to conf/base/parameters/greeting.yml:
    first_name:
          John
    last_name:
          Doe
    
  • cd src/<PROJECT_NAME>/pipelines/greeting
  • add the following to nodes.py
    def hello(first_name, last_name):
      greeting = f"Hello {first_name} {last_name}"
      print(greeting)
      return greeting
    
  • add the following to pipeline.py ```Python from kedro.pipeline import Pipeline, node, pipeline from .nodes import hello

def create_pipeline(**kwargs) -> Pipeline: return pipeline([ node( func=hello, inputs=[“params:first_name”, “params:last_name”], outputs=””, name=”hello” ) ]) ```

  • cd ../../../.. to get to the Kedro project’s root directory
  • kedro run --pipeline greeting (can also kedro run instead because we only have 1 pipeline here)
  • Among the log messages, Kedro should also print Hello John Doe to the screen.

Things to note:

  • The outputs argument is mandatory when creating a node.
    • If you don’t want to save the return value of the node to a variable, specify outputs=''
  • Saving None to a DataSet is not allowed.
    • So, the function within the node should always return something.
  • By default, Kedro takes the inputs to the node from the Data Catalog. When we specify params:<PARAMETER>, it reads from the parameter file of the pipeline it was called from, which in this case is greeting