Found 44,162 repositories(showing 30)
ibis-project
the portable Python dataframe library
microsoft
Simple and Distributed Machine Learning
JohnSnowLabs
State of the Art Natural Language Processing
apache
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
AlexIoannides
Implementing best practices for PySpark ETL jobs and applications.
uber
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
awesome-spark
A curated list of awesome Apache Spark packages and resources.
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
jadianes
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
narwhals-dev
Lightweight and extensible compatibility layer between dataframe libraries!
hi-primus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
jupyter-incubator
Jupyter magics and kernels for working with remote Spark clusters
spark-examples
Pyspark RDD, DataFrame and Dataset Examples in Python language
logicalclocks
Hopsworks - Data-Intensive AI platform with a Feature Store
lakehq
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive AI workloads.
mahmoudparsian
PySpark-Tutorial provides basic algorithms using PySpark
palantir
This is a guide to PySpark code style presenting common situations and the associated best practices based on the most frequent recurring topics across the PySpark repos we've encountered.
kavgan
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
graphframes
GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs
lensacom
PySpark + Scikit-learn = Sparkit-learn
mahmoudparsian
MapReduce, Spark, Java, and Scala for Data Algorithms Book
h2oai
Sparkling Water provides H2O functionality inside Spark cluster
pyspark-ai
English SDK for Apache Spark
lyhue1991
pyspark🍒🥭 is delicious,just eat it!😋😋
HariSekhon
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
WeBankFinTech
Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.
kuwala-io
Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demographics data b) Point of Interests from Open Street Map c) Google Popular Times
MrPowers
PySpark test helper methods with beautiful error messages
ankurchavda
A comprehensive Spark guide collated from multiple sources that can be referred to learn more about Spark or as an interview refresher.
mrpowers-io
pyspark methods to enhance developer productivity 📣 👯 🎉