Found 59 repositories(showing 30)
chrismattmann
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
chrismattmann
Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.
nasa-jpl-memex
Interactive Image similarity and Visual Search and Retrieval application
tspannhw
Open Source Computer Vision with TensorFlow, MiniFi, Apache NiFi, OpenCV, Apache Tika and Python For processing images from IoT devices like Raspberry Pis, NVidia Jetson TX1, NanoPi Duos and more which are equipped with attached cameras or external USB webcams, we use Python to interface via OpenCV and PiCamera. From there we run image processing at the edge on these IoT device using OpenCV and TensorFlow to determine attributes and image analytics. A pache MiniFi coordinates running these Python scripts and decides when and what to send from that analysis and the image to a remote Apache NiFi server for additional processing. At the Apache NiFi cluster in the cluster it routes the images to one processing path and the JSON encoded metadata to another flow. The JSON data (with it's schema referenced from a central Schema Registry) is routed and routed using Record Processing and SQL, this data in enriched and augment before conversion to AVRO to be send via Apache Kafka to SAM. Streaming Analytics Manager then does deeper processing on this stream and others including weather and twitter to determine what should be done on this data. References https://community.hortonworks.com/articles/103863/using-an-asus-tinkerboard-with-tensorflow-and-pyth.html https://community.hortonworks.com/articles/118132/minifi-capturing-converting-tensorflow-inception-t.html https://github.com/tspannhw/rpi-noir-screen https://community.hortonworks.com/articles/77988/ingest-remote-camera-images-from-raspberry-pi-via.html https://community.hortonworks.com/articles/107379/minifi-for-image-capture-and-ingestion-from-raspbe.html https://community.hortonworks.com/articles/58265/analyzing-images-in-hdf-20-using-tensorflow.html
aptivate
Python wrapper for Apache Tika, made to be easy_installed
fedelemantuano
Python bindings for Apache Tika
stumpylog
A modern Python REST client for Apache Tika server
USCDataScience
A suite of Machine Learning / Deep Learning Dockerfiles to allow Apache Tika to extract objects and to produce textual captions for images and video
chrismattmann
The Distributed Release Audit Tool (DRAT) for code analysis and verification.
dokterbob
Very early prototype of a crawler for IPFS, written in Python, using Apache's Tika and Elasticsearch for indexing.
🚴♂️⛷Data Lake, Performance tuning for text extraction from a huge amount of files.
CogStack
This is a python fast-api OCR service, attempting to resolve scalability and performance issues. It also relies on tesseract ocr but without the ambiguities of the Tika framework.
shanedetsch
Gradle tasks that (1) extract the contents from a pdf file using Apache Tika, (2) parse pdf content into sentences using Apache OpenNLP, (3) parse each sentence using Python nltk, OR (4) parse each sentence using Google's SyntaxNet.
sudharsh
Python bindings for Apache Tika
phantom0301
基于Python和Tika的网络富文本元信息爬虫,Web crawler for rich text meta information based on Python and Tika
akarlinsky
Sample notebooks to import and manipulate PDFs using Tika
agriplace
Apache Tika Server python client
opensemanticsearch
tika-python as Debian GNU/Linux and Ubuntu Linux package
HollywoodMarks
Integrating tika-python into AWS Lambda
miodeqqq
Python implementation of using Grobid and Tika.
peterwei425
This project develops a resume fraud detection feature and created a resume system scoring system using NLTK, TIKA and DOCX in Python
petar-popovic-bg
This package provides utility classes and static methods for Python that make use of different third party software commonly used in text processing such as: Unitex-GramLab, TreeTagger, Apache-Tika and Google-Tesseract.
Anzo52
Python PDF reader using tika
vpnry
Convert documents to txt with tika-python
PDF extraction samples comparing Azure Document Intelligence (layout model) 🏢 vs Markitdown ✍️vs Apache Tika
serge-sotnyk
Base docker with Python 3.7 and Java. It can be useful for Python projects with using java-based code like Tika, some models in NLTK
izveigor
Веб-приложение, которое предсказывает тип документа по его содержанию 📝
skupriienko
python module for extracting texts from URL and PDF
sinanercan
This project has been developed for data scraping from monthly added PDF reports on the Mortgage Finance Forecast Archives website (https://www.mba.org/news-and-research/forecasts-and-commentary/mortgage-finance-forecast-archives). The data scraping framework developed in this project converts the tabular data in PDF files to ".csv" format. The developed data scraping framework integrates two different tasks such as web scraping and data extraction. The code uses three different function named as extract_table_columns_name, extract_pdf_data, and get_pdf_files. The get_pdf_files function downloads the pdf files from the archive, and then extracts tabular data and attribute names using extract_pdf_data and extract_table_columns_name functions. The entire structure of the data scraping framework checks out the archive manually for newly added reports and works as a fully automated manner. Besides, the regex expressions in the framework are also compatible with pattern formats that vary in different table structures. In this project, two different packages used for data extraction from PDF files: pdfplumber and tika-python. At the stage of web scraping, the url links of the pdf files are parsed using lxml feature of the BeautifulSoup package. When the code is run for the first time, it creates two files named as archievefiles, archievecsvfiles and it also creates a ".csv" file named as archievelinks in cwd. When the code is run for the second time, it only downloads the pdf files in the newly added url links to the archive and performs other operations. Any further question contact me :
Dr0w
Tika+Python+K8s