Search Results

Found 1,653 repositories(showing 30)

whylogs

whylabs

💛74

An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈

2.8k

136

Apache-2.0

Jupyter Notebook

Updated 5 hours ago

ai-pipelinesanalyticsapproximate-statistics+15

fusilli

florencejt

💛70

A Python package housing a collection of deep-learning multi-modal data fusion method pipelines! From data loading, to training, to evaluation - fusilli's got you covered 🌸

200

AGPL-3.0

Python

Updated 4 days ago

attention-mechanismcnndata-fusion+12

landon

chinedufn

❤️30

A collection of tools, data structures and methods for exporting Blender data (such as meshes and armatures) and preparing it for your rendering pipeline.

145

Rust

Updated 4 months ago

armatureblenderexport+4

Multiverse_of_100-_data_science_project_series

Chando0185

🧡66

Welcome to the Multiverse of Data Science — a comprehensive, ever-expanding collection of over 100 real-world projects covering the entire data science pipeline!

139

Jupyter Notebook

Updated 1 day ago

aiclassificationdatascience+4

illumina-utils

merenlab

❤️45

A library and collection of scripts to work with Illumina paired-end data (for CASAVA 1.7+ pipeline).

GPL-2.0

Python

Updated 1 month ago

xatu

ethpandaops

🧡55

Ethereum network monitoring with collection clients and a centralized server for data pipelining.

GPL-3.0

Updated 1 day ago

Honu

jboulon

❤️35

Honu is a large scale data collection and processing pipeline

Java

Updated 7 months ago

Animal-in-Motion

briannlongzhao

💛70

A data collection and processing pipeline for animal video, annotations include mask, keypoint, depth, occlusion, etc. Suitable for 3D/4D reconstruction, tracking, pose prediction, etc.

Apache-2.0

Python

Updated 12 hours ago

institutional-books-1-pipeline

institutional

🧡60

The Institutional Data Initiative's pipeline for analyzing, refining, and publishing the Institutional Books 1.0 collection.

AGPL-3.0

Python

Updated 1 week ago

ManaTTS-Persian-Speech-Dataset

MahtaFetrat

🧡55

ManaTTS is the largest open Persian speech dataset with 114+ hours of transcribed audio. Includes data collection pipeline and tools. Suitable for Persian text-to-speech models.

MIT

Jupyter Notebook

Updated 4 weeks ago

data-collectiondata-preprocessingdataset-preparation+14

mimiclabs

GaTech-RL2

🧡55

MimicLabs: A Scalable Data Collection & Generation Pipeline for Table-top Manipulation

MIT

Python

Updated 3 weeks ago

robotics

AzLogDcrIngestPS

KnudsenMorten

💛70

AzLogDcrIngestPS - Unleashing the power of Log Ingestion API with Azure LogAnalytics custom table v2, Azure Data Collection Rules and Azure Data Ingestion Pipeline

MIT

PowerShell

Updated 6 days ago

azureazure-pipelinedata+7

RAG-LLM-Development-Guidebook-from-PDFs

GPT-Laboratory

🧡50

Code for building specialized RAG systems using PDF documents with OpenAI Assistant API for GPT and LLaMA models, covering the full pipeline from data collection to generation.

MIT

Python

Updated 1 month ago

Data-Science-KIT

avinash201199

💛70

A curated collection of essential resources, libraries, tools, courses, and playbooks to help you master Data Science, from exploratory analysis to production ML pipelines and everything in between.

MIT

Updated 3 days ago

data-analysisdata-sciencedatascience

Humanoid-Teleop

physical-superintelligence-lab

🧡55

Teleoperation and data collection pipeline for humanoid robots

NOASSERTION

C++

Updated 1 week ago

data-ingest

Project-CETI

❤️35

Source code for the data pipeline that start by ingesting data from the embedded data collection devices (whale tags, moorings, etc), uploads it to the cloud and combines in a dataset consumable by the machine learning pipelines.

MIT

Python

Updated 1 month ago

data-science-toolbox

NERC-CEH

❤️45

Collection of data science methods/pipelines to support the UK's national capability in delivering world-leading environmental science.

MIT

Updated 4 weeks ago

discoverabilitygeneralisabilityintegration+1

Madcaps

Gowthambalan

💛70

A distributed, low-code, end-to-end data collection and analysis tool for data folks. Take the pain out of data collection from your pipeline!

MIT

Python

Updated 2 days ago

serpytor

anuran-roy

❤️25

A distributed, low-code, end-to-end data collection and analysis tool for data folks. Take the pain out of data collection from your pipeline!

MIT

Python

Updated 1 year ago

datadataengineeringdatascience+9

opendata-stack-platform

richban

❤️40

Open Data Stack Platform: a collection of projects and pipelines built with open data stack tools for scalable, observable data platform.

Jupyter Notebook

Updated 1 week ago

dagster-projectdata-engineeringdata-lake+4

APACHE_AIRFLOW_DATA_PIPELINES

ultranet1

❤️45

Project Description: A music streaming company wants to introduce more automation and monitoring to their data warehouse ETL pipelines and they have come to the conclusion that the best tool to achieve this is Apache Airflow. As their Data Engineer, I was tasked to create a reusable production-grade data pipeline that incorporates data quality checks and allows for easy backfills. Several analysts and Data Scientists rely on the output generated by this pipeline and it is expected that the pipeline runs daily on a schedule by pulling new data from the source and store the results to the destination. Data Description: The source data resides in S3 and needs to be processed in a data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to. Data Pipeline design: At a high-level the pipeline does the following tasks. Extract data from multiple S3 locations. Load the data into Redshift cluster. Transform the data into a star schema. Perform data validation and data quality checks. Calculate the most played songs for the specified time interval. Load the result back into S3. dag Structure of the Airflow DAG Design Goals: Based on the requirements of our data consumers, our pipeline is required to adhere to the following guidelines: The DAG should not have any dependencies on past runs. On failure, the task is retried for 3 times. Retries happen every 5 minutes. Catchup is turned off. Do not email on retry. Pipeline Implementation: Apache Airflow is a Python framework for programmatically creating workflows in DAGs, e.g. ETL processes, generating reports, and retraining models on a daily basis. The Airflow UI automatically parses our DAG and creates a natural representation for the movement and transformation of data. A DAG simply is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. A DAG describes how you want to carry out your workflow, and Operators determine what actually gets done. By default, airflow comes with some simple built-in operators like PythonOperator, BashOperator, DummyOperator etc., however, airflow lets you extend the features of a BaseOperator and create custom operators. For this project, I developed several custom operators. operators The description of each of these operators follows: StageToRedshiftOperator: Stages data to a specific redshift cluster from a specified S3 location. Operator uses templated fields to handle partitioned S3 locations. LoadFactOperator: Loads data to the given fact table by running the provided sql statement. Supports delete-insert and append style loads. LoadDimensionOperator: Loads data to the given dimension table by running the provided sql statement. Supports delete-insert and append style loads. SubDagOperator: Two or more operators can be grouped into one task using the SubDagOperator. Here, I am grouping the tasks of checking if the given table has rows and then run a series of data quality sql commands. HasRowsOperator: Data quality check to ensure that the specified table has rows. DataQualityOperator: Performs data quality checks by running sql statements to validate the data. SongPopularityOperator: Calculates the top ten most popular songs for a given interval. The interval is dictated by the DAG schedule. UnloadToS3Operator: Stores the analysis result back to the given S3 location. Code for each of these operators is located in the plugins/operators directory. Pipeline Schedule and Data Partitioning: The events data residing on S3 is partitioned by year (2018) and month (11). Our task is to incrementally load the event json files, and run it through the entire pipeline to calculate song popularity and store the result back into S3. In this manner, we can obtain the top songs per day in an automated fashion using the pipeline. Please note, this is a trivial analyis, but you can imagine other complex queries that follow similar structure. S3 Input events data: s3://<bucket>/log_data/2018/11/ 2018-11-01-events.json 2018-11-02-events.json 2018-11-03-events.json .. 2018-11-28-events.json 2018-11-29-events.json 2018-11-30-events.json S3 Output song popularity data: s3://skuchkula-topsongs/ songpopularity_2018-11-01 songpopularity_2018-11-02 songpopularity_2018-11-03 ... songpopularity_2018-11-28 songpopularity_2018-11-29 songpopularity_2018-11-30 The DAG can be configured by giving it some default_args which specify the start_date, end_date and other design choices which I have mentioned above. default_args = { 'owner': 'shravan', 'start_date': datetime(2018, 11, 1), 'end_date': datetime(2018, 11, 30), 'depends_on_past': False, 'email_on_retry': False, 'retries': 3, 'retry_delay': timedelta(minutes=5), 'catchup_by_default': False, 'provide_context': True, } How to run this project? Step 1: Create AWS Redshift Cluster using either the console or through the notebook provided in create-redshift-cluster Run the notebook to create AWS Redshift Cluster. Make a note of: DWN_ENDPOINT :: dwhcluster.c4m4dhrmsdov.us-west-2.redshift.amazonaws.com DWH_ROLE_ARN :: arn:aws:iam::506140549518:role/dwhRole Step 2: Start Apache Airflow Run docker-compose up from the directory containing docker-compose.yml. Ensure that you have mapped the volume to point to the location where you have your DAGs. NOTE: You can find details of how to manage Apache Airflow on mac here: https://gist.github.com/shravan-kuchkula/a3f357ff34cf5e3b862f3132fb599cf3 start_airflow Step 3: Configure Apache Airflow Hooks On the left is the S3 connection. The Login and password are the IAM user's access key and secret key that you created. Basically, by using these credentials, we are able to read data from S3. On the right is the redshift connection. These values can be easily gathered from your Redshift cluster connections Step 4: Execute the create-tables-dag This dag will create the staging, fact and dimension tables. The reason we need to trigger this manually is because, we want to keep this out of main dag. Normally, creation of tables can be handled by just triggering a script. But for the sake of illustration, I created a DAG for this and had Airflow trigger the DAG. You can turn off the DAG once it is completed. After running this DAG, you should see all the tables created in the AWS Redshift. Step 5: Turn on the load_and_transform_data_in_redshift dag As the execution start date is 2018-11-1 with a schedule interval @daily and the execution end date is 2018-11-30, Airflow will automatically trigger and schedule the dag runs once per day for 30 times. Shown below are the 30 DAG runs ranging from start_date till end_date, that are trigged by airflow once per day. schedule

Python

Updated 1 month ago

UR5-Object-Alignment

sahilrajpurkar03

💛70

A complete imitation learning pipeline for bar alignment using the UR5 robot in NVIDIA Isaac Sim. Includes manual data collection with a game controller, dataset organization for LeRobot, diffusion policy training, and policy deployment through ROS2.

MIT

Python

Updated 2 days ago

dataset-collectiondiffusion-policyimitation-learning+7

CEDAR.GitHub.Collector

microsoft

❤️25

Data collection pipeline for GitHub

MIT

Updated 4 months ago

openl2s

a43992899

❤️35

Open, royalty free, lyrics2song / song generation data collection / cleaning pipeline.

Python

Updated 2 months ago

lcr-modules

LCR-BCCRC

🧡55

Collection of standard analytical pipelines for genomic and transcriptomic data

MIT

Python

Updated 6 days ago

data_engineering_projects

claudiocmm

🧡60

A collection of Data Engineering projects using different cloud providers. Explore real-world implementations of data pipelines, transformations, and workflows in cloud environments.

Python

Updated 10 hours ago

awsdata-engineeringgcp+1

catalogue-pipeline

wellcomecollection

🧡50

:oil_drum: The data pipeline services extracting & transforming data from our museum and collections.

MIT

Scala

Updated 1 day ago

wellcome-digital-platform

kalshi-data-collector

harrodyuan

🧡65

Kalshi Data Collection Pipeline

MIT

Python

Updated 1 day ago

kalshikalshi-apipolymarket

forecast

devsecops

❤️25

Forecast is a big data environment for understanding security anomalies as they are presented in a project and is meant to aid in the collection of data for the end-to-end CICD pipeline.

Apache-2.0

Ruby

Updated 2 years ago

Data_Engineering_Projects

AuFeld

❤️40

A collection of data engineering projects: data modeling, ETL pipelines, data lakes, infrastructure configuration on AWS, data warehousing, containerization, and a dashboard to monitor data pipeline KPIs

MIT

Python

Updated 6 months ago

airflowawscassandra+13

GitHub Explorer

Search Results

whylogs

fusilli

landon

Multiverse_of_100-_data_science_project_series

illumina-utils

xatu

Honu

Animal-in-Motion

institutional-books-1-pipeline

ManaTTS-Persian-Speech-Dataset

mimiclabs

AzLogDcrIngestPS

RAG-LLM-Development-Guidebook-from-PDFs

Data-Science-KIT

Humanoid-Teleop

data-ingest

data-science-toolbox

Madcaps

serpytor

opendata-stack-platform

APACHE_AIRFLOW_DATA_PIPELINES

UR5-Object-Alignment

CEDAR.GitHub.Collector

openl2s

lcr-modules

data_engineering_projects

catalogue-pipeline

kalshi-data-collector

forecast

Data_Engineering_Projects

whylogs

fusilli

landon

Multiverse_of_100-_data_science_project_series

illumina-utils

xatu

Honu

Animal-in-Motion

institutional-books-1-pipeline

ManaTTS-Persian-Speech-Dataset

mimiclabs

AzLogDcrIngestPS

RAG-LLM-Development-Guidebook-from-PDFs

Data-Science-KIT

Humanoid-Teleop

data-ingest

data-science-toolbox

Madcaps

serpytor

opendata-stack-platform

APACHE_AIRFLOW_DATA_PIPELINES

UR5-Object-Alignment

CEDAR.GitHub.Collector

openl2s

lcr-modules

data_engineering_projects

catalogue-pipeline

kalshi-data-collector

forecast

Data_Engineering_Projects