Found 1,146 repositories(showing 30)
laurenz
pgreplay reads a PostgreSQL log file (*not* a WAL file), extracts the SQL statements and executes them in the same order and relative time against a PostgreSQL database cluster.
ftrain
CoffeeScript code to quickly extract and display all the tweetable statements from a live web page.
whoiskatrin
Python script to extract as much structured information as possible from annual/quarterly reports.
secdatabase
SECDatabase.com produced this dataset with the text and detailed numeric information of all financial statements. The Dataset is extracted from corporate annual and quarterly reports filed with the SEC using XBRL since January 2009.
rflugum
Extract the Management Discussion and Analyses (MD&A) section from 10K Financial Statements
TheUpshot
A Ruby gem that extracts press releases and statements by members of Congress.
StatCan
This project uses SLICE algorithm to extract information from a text-based PDF page containing financial statements (tabular data). It can also be used to extract regular tables but will contain all text on a page.
pdfdotco
PDF.co Web API - source code samples. Extract data from pdf, parse invoice, statements, paystubs, claims, scanned documents, split, merge, compress PDF and more! Sign up today for your API key
css-modules
A CSS Modules transform to extract export statements from local-scope classes
A Python tool to parse PDF statements from Poste Italiane (Postepay, BancoPosta) and extract data as structured JSON.
Hussain-Alsalman
The tasi package for R is designed to assist the quantitative trader to extract Saudi stock market historical prices and financial statements.
tniedbala
Simple Python utility that downloads and extracts SEC financial statement data sets.
Thukyd
Python script to extract TradeRepublic transaction statements (PDF) and generate .csv overview sheets (Portfolio Performance & Investing.com).
marcocesarato
This class can parse SQL to get query type, tables, field values, etc.. It takes an string with a SQL statements and parses it to extract its different components. Currently the class can extract the SQL query method, the names of the tables involved in the query and the field values that are passed as parameters. This parser is pretty light respect phpsqlparser or others php sql parser.
bnwlkr
Extract transaction data from RBC e-statements 💰
css-modules
A CSS Modules transform to extract export statements from local-scope classes
sidharth-panwar
Logseq plugin to extract highlights and bold statements from a block
SudheerNotes
A simple piece of software that will extract CAMS Mutual fund PDF statement (India) data on to a CSV file.
Zipstack
Extracting structured JSON from credit card statements using Langchain and Pydantic
Tony-Hao
a Python tool to extract and structure numeric lab test comparison statements from text
MAydogdu
Extracting sentiment from financial statements using neural networks
PBPatil
Problem Statement: Given a particular PDF/Text document ,How to extract keywords and arrange in order of their weightage using Python?
qwerqy
Node.js package designed to extract and analyze import statements from TypeScript and TypeScript JSX files. It provides a simple and efficient way to scan your codebase for import declarations, making it useful for various code analysis and refactoring tasks.
arturseo-geo
Anti-hallucination research skill for Claude Code — admits uncertainty, extracts direct quotes before analysis, cites every claim, retracts unverifiable statements. Based on Anthropic's official guardrail techniques. By TheGEOLab.net
ideas4u
This project is the most awaited project in open source community where every user who belongs to Stock Trading always wanted to develop its own software. This project has been developed specifically for Indian Market Stock Trading. It encompasses end to end trading cycle for intraday trading but the design would be such that it can be easily extended for delivery trading. During the lifecycle of this project we will be using most advance technologies but the base code will always be C/C++. Development Methodology: ======================== We use "Incremental Life Cycle Model" along with Cross-Platform Development (Portable). Project Priorities and Assumptions: =================================== 1) Low Latency, High Performance all the time. 2) Wherever choice has to be made between memory and execution speed, we give preference to speed. 3) Every module devloped will be exhaustively tested. How the work Proceed: ===================== Before the beginning of any new project, we should know the "PROBLEM STATEMENT", so here it is "Problem Statement" ------------------- To Build a high performance, low latency, end to end Trading Platform for Indian Stock Market but not limited to which home users should be able use for trading which guarantees (99% of the times) the profit but does not guarantees maximized profit for intraday trading. First Step: ----------- To provide the optimal solution to any problem is "UNDERSTAING THE PROBLEM". To understand the above problem statement you need to really extract the explicit and implcit requirements from the statement. Here is the List of requirements: Explicit: --------- 1) High Performance 2) Low-Latency 3) End-to-End Trading Platform 4) Focus on Indian Stock Market but not limited to it. 5) Guarantees (99% of the times) the profit but does not guarantees maximized profit. 6) Only for Intraday Trading. Implicit: --------- 1) Book Keeping of the order and trade (Order Management System). 2) Availability of Market Data to End-Users on Demand for identifying the stock and placing the order. 3) User Account Management. Might be I missed something please suggest and after reveiw we add it here. Second Step: ------------ To understand the above Explicit/Implicit requirements, you should have the "KNOWLEDGE OF VARIOUS TECHNOLOGIES" and indepth undertstanding of the "PROBLEM DOMAIN" i.e. Stock Market. Once this is achieved we need to architect the solution in terms of Software and Hardware nodes and their integration. Third Step: ----------- To solve the problem statement, the above requirements should be "DECOMPOSED IN MODULES" and map to them with technolgoies/software/hardware used. Below is the list of modules we are able to identify: Modules Included: ================= Core Modules: -------------- 1) Core Libraries 2) Manual Order Entry System 3) Auto Order Entry System 4) Artificial Exchange 5) Algorithmic Trading Platform 6) Smart Order Router 7) Direct Trading Platform (Ooptional) Utility Modules: ---------------- 8) Logger Server 9) HeartBeat Server Technologies Used: ================= Software: --------- We always use freeware, Open Source Softwares or APIs which are the part of GPL, LGPL.xx licence. Any special requirement for building/using the modules will be detailed in specific module. For development, we generally use: ---------------------------------- Windows-7 for Operating System but any other OS ca be used. Our Code is Platform Indepandant. Visual Studio 2013 in built compiler for build or Intel@ Compilers which can be easily integrated with Visual Studio IDE. For real time, we generally use: -------------------------------- Linux-susse 10 or above with real time extensions. gcc 4.4.1 for build. vi editor Hardware: --------- No special requirement for development purpose. For real time use, it depands how much Stock you are interested in and the various configuration of modules. We prefer generally the below configuration for any number of Stock Trading: 256 GB RAM 16 core processor 1 TB of HDD/SDD Programming Languages and other Technologies: --------------------------------------------- C, C++99/c++11, Lua, ZeroMq, nanodbc, Lock-Free Data Structures, Intel TBB, Boost, Google Protobuf, MySql, Python. Fourth Step: ------------ Dcompose each module till it becomes entity to provide the useful functionality. We are going to explain this in each module detailed section. Fifth Step: ------------ We do design/develop/benchmark/unit test/integration testing of the above modules. Sixth Step: ------------ We deploy the delivered software on various hardware nodes as per the deployment architecture and integrate them. Seventh Step: ------------ Observe the behaviour of deployed software on live traffic and cut two branches at this level : 1st branch continue to do incremental development and 2nd branch fix the issues reported which can be later merged with 1st branch for another release. Any suggestions for improvement are most welcome.
ultranet1
Project Description: A music streaming company wants to introduce more automation and monitoring to their data warehouse ETL pipelines and they have come to the conclusion that the best tool to achieve this is Apache Airflow. As their Data Engineer, I was tasked to create a reusable production-grade data pipeline that incorporates data quality checks and allows for easy backfills. Several analysts and Data Scientists rely on the output generated by this pipeline and it is expected that the pipeline runs daily on a schedule by pulling new data from the source and store the results to the destination. Data Description: The source data resides in S3 and needs to be processed in a data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to. Data Pipeline design: At a high-level the pipeline does the following tasks. Extract data from multiple S3 locations. Load the data into Redshift cluster. Transform the data into a star schema. Perform data validation and data quality checks. Calculate the most played songs for the specified time interval. Load the result back into S3. dag Structure of the Airflow DAG Design Goals: Based on the requirements of our data consumers, our pipeline is required to adhere to the following guidelines: The DAG should not have any dependencies on past runs. On failure, the task is retried for 3 times. Retries happen every 5 minutes. Catchup is turned off. Do not email on retry. Pipeline Implementation: Apache Airflow is a Python framework for programmatically creating workflows in DAGs, e.g. ETL processes, generating reports, and retraining models on a daily basis. The Airflow UI automatically parses our DAG and creates a natural representation for the movement and transformation of data. A DAG simply is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. A DAG describes how you want to carry out your workflow, and Operators determine what actually gets done. By default, airflow comes with some simple built-in operators like PythonOperator, BashOperator, DummyOperator etc., however, airflow lets you extend the features of a BaseOperator and create custom operators. For this project, I developed several custom operators. operators The description of each of these operators follows: StageToRedshiftOperator: Stages data to a specific redshift cluster from a specified S3 location. Operator uses templated fields to handle partitioned S3 locations. LoadFactOperator: Loads data to the given fact table by running the provided sql statement. Supports delete-insert and append style loads. LoadDimensionOperator: Loads data to the given dimension table by running the provided sql statement. Supports delete-insert and append style loads. SubDagOperator: Two or more operators can be grouped into one task using the SubDagOperator. Here, I am grouping the tasks of checking if the given table has rows and then run a series of data quality sql commands. HasRowsOperator: Data quality check to ensure that the specified table has rows. DataQualityOperator: Performs data quality checks by running sql statements to validate the data. SongPopularityOperator: Calculates the top ten most popular songs for a given interval. The interval is dictated by the DAG schedule. UnloadToS3Operator: Stores the analysis result back to the given S3 location. Code for each of these operators is located in the plugins/operators directory. Pipeline Schedule and Data Partitioning: The events data residing on S3 is partitioned by year (2018) and month (11). Our task is to incrementally load the event json files, and run it through the entire pipeline to calculate song popularity and store the result back into S3. In this manner, we can obtain the top songs per day in an automated fashion using the pipeline. Please note, this is a trivial analyis, but you can imagine other complex queries that follow similar structure. S3 Input events data: s3://<bucket>/log_data/2018/11/ 2018-11-01-events.json 2018-11-02-events.json 2018-11-03-events.json .. 2018-11-28-events.json 2018-11-29-events.json 2018-11-30-events.json S3 Output song popularity data: s3://skuchkula-topsongs/ songpopularity_2018-11-01 songpopularity_2018-11-02 songpopularity_2018-11-03 ... songpopularity_2018-11-28 songpopularity_2018-11-29 songpopularity_2018-11-30 The DAG can be configured by giving it some default_args which specify the start_date, end_date and other design choices which I have mentioned above. default_args = { 'owner': 'shravan', 'start_date': datetime(2018, 11, 1), 'end_date': datetime(2018, 11, 30), 'depends_on_past': False, 'email_on_retry': False, 'retries': 3, 'retry_delay': timedelta(minutes=5), 'catchup_by_default': False, 'provide_context': True, } How to run this project? Step 1: Create AWS Redshift Cluster using either the console or through the notebook provided in create-redshift-cluster Run the notebook to create AWS Redshift Cluster. Make a note of: DWN_ENDPOINT :: dwhcluster.c4m4dhrmsdov.us-west-2.redshift.amazonaws.com DWH_ROLE_ARN :: arn:aws:iam::506140549518:role/dwhRole Step 2: Start Apache Airflow Run docker-compose up from the directory containing docker-compose.yml. Ensure that you have mapped the volume to point to the location where you have your DAGs. NOTE: You can find details of how to manage Apache Airflow on mac here: https://gist.github.com/shravan-kuchkula/a3f357ff34cf5e3b862f3132fb599cf3 start_airflow Step 3: Configure Apache Airflow Hooks On the left is the S3 connection. The Login and password are the IAM user's access key and secret key that you created. Basically, by using these credentials, we are able to read data from S3. On the right is the redshift connection. These values can be easily gathered from your Redshift cluster connections Step 4: Execute the create-tables-dag This dag will create the staging, fact and dimension tables. The reason we need to trigger this manually is because, we want to keep this out of main dag. Normally, creation of tables can be handled by just triggering a script. But for the sake of illustration, I created a DAG for this and had Airflow trigger the DAG. You can turn off the DAG once it is completed. After running this DAG, you should see all the tables created in the AWS Redshift. Step 5: Turn on the load_and_transform_data_in_redshift dag As the execution start date is 2018-11-1 with a schedule interval @daily and the execution end date is 2018-11-30, Airflow will automatically trigger and schedule the dag runs once per day for 30 times. Shown below are the 30 DAG runs ranging from start_date till end_date, that are trigged by airflow once per day. schedule
Extract useful insights from PDF Bank Statements(Indian Banks) using python automation
digicademy
A generic webservice to extract RDF statements from XML resources
Extracts information from an HDFC Bank Credit Card Statement (India)
eric-wieser
Python scrapers for extracting bank statements from tesco, santander, and lloyds in the QIF format