Data Engineering

The Analytics Setup Guidebook

https://www.holistics.io/books/setup-analytics/

How to connect to CDP Impala from python

https://community.cloudera.com/t5/Community-Articles/How-to-connect-to-CDP-Impala-from-python/ta-p/296405

Step 1: Setup Impala JDBC drivers

First, download the latest impala JDBC drivers from Cloudera JDBC Driver 2.6.17 for Impala.

Then, upload them to your machine.

Finally, make sure that you set up your CLASSPATH properly by opening a terminal session and typing the following:

CLASSPATH=.:/home/cdsw/ImpalaJDBC4.jar:/home/cdsw/ImpalaJDBC41.jar:/home/cdsw/ImpalaJDBC42.jar

export CLASSPATH

In Windows, you can add the CLASSPATH variable to “Edit the system environment variables” in Contol Panel.

Step 2: Install JayDeBeApi

To install JayDeBeApi, run the following:

pip3 install JayDeBeApi

A recommended step to avoid getting an error along the lines of "AttributeError: type object 'java.sql.Types' has no attribute '__javaclass__'", would be to downgrade your jpype by running the following:

pip3 install --upgrade jpype1==0.6.3 --user​

Restart your kernel when you perform the downgrade.

Step 3: Connect to Impala

Finally, connect to your impala, using the following sample code:

import jaydebeapi
conn = jaydebeapi.connect("com.cloudera.impala.jdbc.DataSource",
                          "jdbc:impala://[your_host]:443/;ssl=1;transportMode=http;httpPath=icml-data-mart/cdp-proxy-api/impala;AuthMech=3;",
                          {'UID': "[your_cdp_user]", 'PWD': "[your_workload_pwd]"},
                          '/home/cdsw/ImpalaJDBC41.jar')
curs = conn.cursor()

curs.execute("select * from default.locations")
curs.fetchall()

curs.close()
conn.close()

Note: You can get your impala JDBC string either from the Datahub endpoint path or from the JDBC URL from CDW.

Source: https://community.cloudera.com/t5/Community-Articles/How-to-connect-to-CDP-Impala-from-python/ta-p/296405

Kerberos authentication explained

https://www.varonis.com/blog/kerberos-authentication-explained

Authentication and Authorization

https://www.udacity.com/course/authentication-authorization-oauth–ud330
https://learn.microsoft.com/en-us/previous-versions/msp-n-p/ff647503(v=pandp.10)
https://www.bu.edu/tech/about/security-resources/bestpractice/auth/
https://solutionsreview.com/identity-management/top-9-authentication-books-for-professionals/

Change Data Capture

https://github.com/foogaro/change-data-capture
https://github.com/abrarsheikhsony/SFDC-change-data-capture

LinkedIn Posts

Generally I get paid for improving the learning experience of students. I realized youtube Ads is an exception to this. So I decided to turn off the monetization on my entire channel.

For me Students Learning Experience matters the most!

My youtube channel has a complete playlist on SQL & also has a lot of videos for Big Data Enthusiasts.

Step by Step approach to Master Big Data (Free Resources)

Step 1 - Learn SQL

📌 Basics - https://lnkd.in/gdnhRk8b

📌 Advanced - https://lnkd.in/g8tyEKbU

📌 Leetcode - https://lnkd.in/gKeSMPmW

Learn Python basics -

📌 Python Tutorial : https://lnkd.in/gPBDBhpA

📌 Python for Beginners : https://lnkd.in/gHWyQfQX

Big Data Concepts -

📌 Big Data Fundamentals https://lnkd.in/fWZPWKP

📌 HDFS Architecture https://lnkd.in/fNP7bf7

📌 Mapreduce Fundamentals https://lnkd.in/g457Wmv

📌 Hive tutorial for Beginners https://lnkd.in/gJpDMTfD

📌 Introduction to Apache Spark https://lnkd.in/gFRpe3-D

📌 Spark Accumulator & Shared Variables https://lnkd.in/geZQaV3Y

📌 Big Data on AWS Cloud https://lnkd.in/fBMf6Ac

📌 Big Data Project Use case https://lnkd.in/gFRpe3-D

Interview Questions -

📌 partitioning vs bucketing https://lnkd.in/gmbiKf3r

📌 ORC vs Parquet file format https://lnkd.in/gM2Q8Egg

📌 Avro vs Parquet https://lnkd.in/gg-NcyNJ

📌 Avro vs ORC vs Parquet file https://lnkd.in/gizVx2Kw

📌 what is serde https://lnkd.in/gxDVFTQJ

📌 Row based vs Column based file formats https://lnkd.in/gN3vUsb6

📌 Spark Interview Question https://lnkd.in/fUD6skU

Big Data General Concepts -

📌 5 Tips to prepare for Big Data Interviews https://lnkd.in/gVcEjskn

📌 Scalability vs Availability & Low Latency vs High Latency https://lnkd.in/gFkXxKns

📌 what is the future of Big Data https://lnkd.in/gcJ6xSqW

📌 Difference between Database vs Data lake vs Warehouse https://lnkd.in/gS4kWruJ

Big Data Mock Interviews -

📌 9 Big Data Mock Interviews Playlist https://lnkd.in/g78x9KCa

📌 Link to Subscribe to my youtube channel https://lnkd.in/geJt-sMS

Pandas ETL

https://blog.devgenius.io/basic-etl-using-pandas-23729ae4e05e
https://pbpython.com/pandas-grouper-agg.html

How To Start Your Next Data Engineering Project

https://seattledataguy.substack.com/p/how-to-start-your-next-data-engineering