Search Results

Found 6,045 repositories(showing 30)

databricks-sql-python

databricks

❤️47

Databricks SQL Connector for Python

226

140

Apache-2.0

Python

Updated 1 week ago

databricksdwhpython3+1

EComDWH-BatchDataProcessingPlatform

Smars-Bin-Hu

🧡60

This project aims to build an enterprise-grade offline data warehouse solution based on e-commerce platform order data.

167

MIT

Python

Updated 3 weeks ago

big-datadata-warehousehadoop+4

ETL_with_Python

dimgold

❤️46

ETL with Python - Taught at DWH course 2017 (TAU)

103

Jupyter Notebook

Updated 2 months ago

csvdata-basedata-science+10

mara-schema

mara

❤️25

Mapping of DWH database tables to business entities, attributes & metrics in Python, with automatic creation of flattened tables

MIT

Python

Updated 3 months ago

data-governancedata-modelingdatawarehousing+2

3DWheelPicker

yijiebuyi

❤️35

Android 数据选择器，支持3D滚轮滚动效果，支持各种时间选择，城市选择，普通数据选择，级联数据选择

Java

Updated 3 months ago

3dwheelpickerandroiddatapicker+1

dwh-migration-tools

google

❤️36

No description available

Apache-2.0

Java

Updated 1 week ago

databricks-sql-go

databricks

🧡56

Golang database/sql driver for Databricks SQL.

Apache-2.0

Updated 16 hours ago

databricksdwhgolang+2

TraditionalModernDW

justBlindbaek

🧡55

Simple cloud only DWH solution architecture.

TSQL

Updated 3 weeks ago

architectureazure-data-factoryazure-data-lake+2

dbt-core-interface is an MIT licensed high level wrapper for dbt-core that can be used to drive third party integrations such as servers, CI automation, DWH automation, etc. without duplicate boilerplate.

MIT

Python

Updated 2 months ago

dbtpython

OLAP-cube

fibo

🧡55

is an hypercube of data

MIT

JavaScript

Updated 1 week ago

business-intelligencecubedata-warehouse+8

databricks-sql-nodejs

databricks

🧡56

Databricks SQL Connector for Node.js

Apache-2.0

TypeScript

Updated 17 hours ago

databricksdwhnode+3

sql-data-warehouse-project

KL-2300032590

💛70

building an DWH with SQL,including ETL processes,modelling and analysis

MIT

TSQL

Updated 2 hours ago

datapreprocessingdatawarehouseetl-pipeline+1

dbt_airbyte_shopify_facebook_paypal_fedex_gls_ecommerce_profitability

Snowboard-Software

❤️45

This repository is a production dbt pipeline example that model the profitability of an e-commerce business. Data is extracted and loaded to a BigQuery dwh by Airbyte. Data sources include Shopify, Facebook Ads, Paypal, Fedex and GLS shipping data and manufacturing costs.

Updated 1 month ago

DWH

D35m0nd142

❤️35

Simple (but working) WEP/WPA/WPA2 Hacking script

Python

Updated 3 months ago

Datawarehouse

Nathnael12

❤️40

Fully dockerized Data Warehouse (DWH) using Airflow, dbt, PostgreSQL and dashboard using redash

MIT

Jupyter Notebook

Updated 3 months ago

airflowdataengineeringdatawarehouse+3

personal-swiss-finance-dw

ssp-data

🧡65

Personal Finance Project to automatically collect swiss banking transaction into a DWH and visualise it

Python

Updated 2 days ago

Prescriber-ETL-data-pipeline

judeleonard

🧡50

An End-to-End ETL data pipeline that leverages pyspark parallel processing to process about 25 million rows of data coming from a SaaS application using Apache Airflow as an orchestration tool and various data warehouse technologies and finally using Apache Superset to connect to DWH for generating BI dashboards for weekly reports

Apache-2.0

Python

Updated 2 months ago

airflowairflow-dockeramazon-s3+8

dwhwrapper

xlfe

❤️35

cli wrapper for Teradata data warehouse utilities (BTEQ,etc..)

Python

Updated 1 year ago

BigDataInDepth

MoustafaAMahmoud

🧡55

Data Engineering Course

TeX

Updated 1 week ago

distributed-systemsdistrubted-systemsdwh+5

hnhm

marchinho11

❤️30

Toolkit for Agile-driven data modeling and data loading using highly Normalized hybrid Model

Python

Updated 7 months ago

agileanchor-modelingdata-modeling+3

APACHE_AIRFLOW_DATA_PIPELINES

ultranet1

❤️45

Project Description: A music streaming company wants to introduce more automation and monitoring to their data warehouse ETL pipelines and they have come to the conclusion that the best tool to achieve this is Apache Airflow. As their Data Engineer, I was tasked to create a reusable production-grade data pipeline that incorporates data quality checks and allows for easy backfills. Several analysts and Data Scientists rely on the output generated by this pipeline and it is expected that the pipeline runs daily on a schedule by pulling new data from the source and store the results to the destination. Data Description: The source data resides in S3 and needs to be processed in a data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to. Data Pipeline design: At a high-level the pipeline does the following tasks. Extract data from multiple S3 locations. Load the data into Redshift cluster. Transform the data into a star schema. Perform data validation and data quality checks. Calculate the most played songs for the specified time interval. Load the result back into S3. dag Structure of the Airflow DAG Design Goals: Based on the requirements of our data consumers, our pipeline is required to adhere to the following guidelines: The DAG should not have any dependencies on past runs. On failure, the task is retried for 3 times. Retries happen every 5 minutes. Catchup is turned off. Do not email on retry. Pipeline Implementation: Apache Airflow is a Python framework for programmatically creating workflows in DAGs, e.g. ETL processes, generating reports, and retraining models on a daily basis. The Airflow UI automatically parses our DAG and creates a natural representation for the movement and transformation of data. A DAG simply is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. A DAG describes how you want to carry out your workflow, and Operators determine what actually gets done. By default, airflow comes with some simple built-in operators like PythonOperator, BashOperator, DummyOperator etc., however, airflow lets you extend the features of a BaseOperator and create custom operators. For this project, I developed several custom operators. operators The description of each of these operators follows: StageToRedshiftOperator: Stages data to a specific redshift cluster from a specified S3 location. Operator uses templated fields to handle partitioned S3 locations. LoadFactOperator: Loads data to the given fact table by running the provided sql statement. Supports delete-insert and append style loads. LoadDimensionOperator: Loads data to the given dimension table by running the provided sql statement. Supports delete-insert and append style loads. SubDagOperator: Two or more operators can be grouped into one task using the SubDagOperator. Here, I am grouping the tasks of checking if the given table has rows and then run a series of data quality sql commands. HasRowsOperator: Data quality check to ensure that the specified table has rows. DataQualityOperator: Performs data quality checks by running sql statements to validate the data. SongPopularityOperator: Calculates the top ten most popular songs for a given interval. The interval is dictated by the DAG schedule. UnloadToS3Operator: Stores the analysis result back to the given S3 location. Code for each of these operators is located in the plugins/operators directory. Pipeline Schedule and Data Partitioning: The events data residing on S3 is partitioned by year (2018) and month (11). Our task is to incrementally load the event json files, and run it through the entire pipeline to calculate song popularity and store the result back into S3. In this manner, we can obtain the top songs per day in an automated fashion using the pipeline. Please note, this is a trivial analyis, but you can imagine other complex queries that follow similar structure. S3 Input events data: s3://<bucket>/log_data/2018/11/ 2018-11-01-events.json 2018-11-02-events.json 2018-11-03-events.json .. 2018-11-28-events.json 2018-11-29-events.json 2018-11-30-events.json S3 Output song popularity data: s3://skuchkula-topsongs/ songpopularity_2018-11-01 songpopularity_2018-11-02 songpopularity_2018-11-03 ... songpopularity_2018-11-28 songpopularity_2018-11-29 songpopularity_2018-11-30 The DAG can be configured by giving it some default_args which specify the start_date, end_date and other design choices which I have mentioned above. default_args = { 'owner': 'shravan', 'start_date': datetime(2018, 11, 1), 'end_date': datetime(2018, 11, 30), 'depends_on_past': False, 'email_on_retry': False, 'retries': 3, 'retry_delay': timedelta(minutes=5), 'catchup_by_default': False, 'provide_context': True, } How to run this project? Step 1: Create AWS Redshift Cluster using either the console or through the notebook provided in create-redshift-cluster Run the notebook to create AWS Redshift Cluster. Make a note of: DWN_ENDPOINT :: dwhcluster.c4m4dhrmsdov.us-west-2.redshift.amazonaws.com DWH_ROLE_ARN :: arn:aws:iam::506140549518:role/dwhRole Step 2: Start Apache Airflow Run docker-compose up from the directory containing docker-compose.yml. Ensure that you have mapped the volume to point to the location where you have your DAGs. NOTE: You can find details of how to manage Apache Airflow on mac here: https://gist.github.com/shravan-kuchkula/a3f357ff34cf5e3b862f3132fb599cf3 start_airflow Step 3: Configure Apache Airflow Hooks On the left is the S3 connection. The Login and password are the IAM user's access key and secret key that you created. Basically, by using these credentials, we are able to read data from S3. On the right is the redshift connection. These values can be easily gathered from your Redshift cluster connections Step 4: Execute the create-tables-dag This dag will create the staging, fact and dimension tables. The reason we need to trigger this manually is because, we want to keep this out of main dag. Normally, creation of tables can be handled by just triggering a script. But for the sake of illustration, I created a DAG for this and had Airflow trigger the DAG. You can turn off the DAG once it is completed. After running this DAG, you should see all the tables created in the AWS Redshift. Step 5: Turn on the load_and_transform_data_in_redshift dag As the execution start date is 2018-11-1 with a schedule interval @daily and the execution end date is 2018-11-30, Airflow will automatically trigger and schedule the dag runs once per day for 30 times. Shown below are the 30 DAG runs ranging from start_date till end_date, that are trigged by airflow once per day. schedule

Python

Updated 1 month ago

postgres-dwh

sdw-online

❤️45

A Postgres data warehouse for processing synthetic data using IAC principles

Python

Updated 2 months ago

BioDWH2

❤️20

BioDWH2 is an easy-to-use, automated, graph-based data warehouse and mapping tool for bioinformatics and medical informatics.

MIT

Java

Updated 4 months ago

biodwh2data-integrationdata-warehouse+6

dwh-mixpanel

ak--47

❤️40

🏭 reverse-ETL from your DWH to mixpanel

MIT

JavaScript

Updated 2 years ago

rivery_cli

RiveryIO

❤️20

Rivery CLI

Python

Updated 8 months ago

data-pipelinedata-pipelinesdata-science+9

DWH-to-go

d-gilles

🧡50

No description available

Jupyter Notebook

Updated 2 days ago

neo4j-dwh-connector

neo4j-contrib

❤️20

No description available

Scala

Updated 3 months ago

hacktoberfest

watch

alx-sdv

❤️35

Oracle DB monitoring

MIT

Python

Updated 1 year ago

databasedwhmonitoring+2

snowflake-get-ddl-tools

dwh-dev

❤️40

Collection of Snowflake Scripting procedures extending GET_DDL function by dwh.dev.

MIT

PLpgSQL

Updated 11 months ago

nifi-postgres-metabase

mvrabel

❤️40

Template for creating batch based ETL workflow for datawarehouses

Apache-2.0

PLpgSQL

Updated 2 years ago

apache-nifidatawarehousedwh+4

GitHub Explorer

Search Results

databricks-sql-python

EComDWH-BatchDataProcessingPlatform

ETL_with_Python

mara-schema

3DWheelPicker

dwh-migration-tools

databricks-sql-go

TraditionalModernDW

dbt-core-interface

OLAP-cube

databricks-sql-nodejs

sql-data-warehouse-project

dbt_airbyte_shopify_facebook_paypal_fedex_gls_ecommerce_profitability

DWH

Datawarehouse

personal-swiss-finance-dw

Prescriber-ETL-data-pipeline

dwhwrapper

BigDataInDepth

hnhm

APACHE_AIRFLOW_DATA_PIPELINES

postgres-dwh

BioDWH2

dwh-mixpanel

rivery_cli

DWH-to-go

neo4j-dwh-connector

watch

snowflake-get-ddl-tools

nifi-postgres-metabase

databricks-sql-python

EComDWH-BatchDataProcessingPlatform

ETL_with_Python

mara-schema

3DWheelPicker

dwh-migration-tools

databricks-sql-go

TraditionalModernDW

dbt-core-interface

OLAP-cube

databricks-sql-nodejs

sql-data-warehouse-project

dbt_airbyte_shopify_facebook_paypal_fedex_gls_ecommerce_profitability

DWH

Datawarehouse

personal-swiss-finance-dw

Prescriber-ETL-data-pipeline

dwhwrapper

BigDataInDepth

hnhm

APACHE_AIRFLOW_DATA_PIPELINES

postgres-dwh

BioDWH2

dwh-mixpanel

rivery_cli

DWH-to-go

neo4j-dwh-connector

watch

snowflake-get-ddl-tools

nifi-postgres-metabase