Found 126,562 repositories(showing 30)
binhnguyennus
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
minimaxir
The Big List of Naughty Strings is a list of strings which have a high probability of causing issues when used as user-input data.
ClickHouse
ClickHouseยฎ is a real-time analytics database management system
apache
Apache Spark - A unified analytics engine for large-scale data processing
donnemartin
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
apache
Apache Flink
thingsboard
Open-source IoT Platform - Device management, data collection, processing and visualization.
amark
An open source cybersecurity protocol for syncing decentralized graph data.
heibaiying
ๅคงๆฐๆฎๅ ฅ้จๆๅ :star:
prestodb
The official home of the Presto distributed SQL query engine for big data
andkret
The Data Engineering Cookbook
oxnr
A curated list of awesome big data frameworks, ressources and other awesomeness.
trinodb
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
apache
PredictionIO, a machine learning server for developers and ML engineers.
vesoft-inc
A distributed, fast open-source graph database featuring horizontal scalability and high availability
provectus
Open-Source Web UI for Apache Kafka Management
yahoo
CMAK is a tool for managing Apache Kafka clusters
StarRocks
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
quickwit-oss
Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
cython
The most widely used Python to C compiler
wangzhiwubigdata
ไธๆณจๅคงๆฐๆฎๅญฆไน ้ข่ฏ๏ผๅคงๆฐๆฎๆ็ฅไน่ทฏๅผๅฏใFlink/Spark/Hadoop/Hbase/Hive...
catboost
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
delta-io
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
apache
Apache DataFusion SQL Query Engine
apache
Apache Beam is a unified programming model for Batch and Streaming data processing.
vaexio
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second ๐
h2oai
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
arkime
Arkime is an open source, large scale, full packet capturing, indexing, and database system.
feast-dev
The Open Source Feature Store for AI/ML
vespa-engine
AI + Data, online. https://vespa.ai