Found 566 repositories(showing 30)
mehmet-kozan
Pure TypeScript, cross-platform module for extracting text, images, and tabular data from PDFs. Run 🤗 directly in your browser or in Node.js
JuniverseCoder
This tool converts PDF files and images into editable PowerPoint presentations (`.pptx`) by leveraging structured data from the MinerU PDF Extractor. It accurately reconstructs text, images, and layout, providing a high-fidelity, editable version of the original document.
nainiayoub
PDF text data extraction web app with OCR for scanned documents
StatCan
This project uses SLICE algorithm to extract information from a text-based PDF page containing financial statements (tabular data). It can also be used to extract regular tables but will contain all text on a page.
mayaramyadav
Laravel OCR & Document Data Extractor – A powerful OCR and document parsing engine for Laravel. It provides intelligent text extraction, structured data parsing, and AI-powered cleanup for documents like invoices, receipts, and PDFs.
shine-jayakumar
Batch-convert pdf to text, extract data from pdf in python
amitgupta4407
This is a complete website in which you can chat with pdf, extract meta data, text, links, image, and lot more . Check my blog for more details: https://medium.com/@amit.2503719/allaboutpdf-tool-for-data-extraction-and-talking-to-pdf-using-chatpdf-feature-f2daea15a59c
umairalipathan1980
A document data extraction system that automatically generates Pydantic schemas from natural language requirements and extracts structured data from PDFs and text documents.
jina-ai
Jina Executor used for extracting images and text as chunks from PDF data
shivangmauryaa
FileFetcher is a Python tool that extracts and filters URLs from archived Wayback Machine data based on file types like .pdf, .zip, .sql, and more. It checks the availability of each URL, saving valid ones with a 200 OK response to a text file, ideal for research or web scraping.
chen1tian
extract pdf table data using camelot, use ocr extract text from image-base pages
aws-solutions-library-samples
This GenAI solution enables users to extract insights from diverse data formats (video, audio, PDFs, text) through a unified interface. Using AWS AI services, it processes and indexes content at scale, provides quick summaries, and delivers contextualized answers with direct links to source materials.
michaelmml
Automated PDF and text processing with Spacy and NLTK; information extraction from text based on grammatical structure; deployed on extracted raw search data
Popstar225
No description available
MistralOCR is an open-source application that transforms documents into structured data using Mistral AI's OCR capabilities. Built with FastAPI and Streamlit, it provides an intuitive interface for extracting and processing text from PDFs and images, making document digitization effortless and accurate.
mixelpixx
Process PDF to extract Text as structed JSON, Tables and PNG of Tables, and XLSX of data in Tables
pettarin
A Python script to extract the plain-text data from the NVdB PDF file.
Extracted tabular data from scanned images or PDF documents and restructured text into well-formatted tables exported to CSV/Excel files.
MediaWiki:FileUploadWizard.js From Wikipedia, the free encyclopedia Jump to navigationJump to search Note: After saving, you have to bypass your browser's cache to see the changes. Internet Explorer: hold down the Ctrl key and click the Refresh or Reload button. Firefox: hold down the Shift key while clicking Reload (or press Ctrl-Shift-R). Google Chrome and Safari users can just click the Reload button. For details and instructions about other browsers, see Wikipedia:Bypass your cache. /* * =============================================================== * FileUploadWizard.js * Script for uploading files through a dynamic questionnaire. * This is the code to accompany [[Wikipedia:File Upload Wizard]]. * =============================================================== */ var fuwTesting = false; var fuwDefaultTextboxLength = 60; var fuwDefaultTextareaWidth = '90%'; var fuwDefaultTextareaLines = 3; // ================================================================ // Constructor function of global fuw (= File Upload Wizard) object // ================================================================ function fuwGlobal() { // Loading the accompanying .css mw.loader.load( mw.config.get('wgServer') + mw.config.get('wgScriptPath') + '/index.php?title=MediaWiki:FileUploadWizard.css&action=raw&ctype=text/css', 'text/css' ); // see if user is logged in, autoconfirmed, experienced etc. this.getUserStatus(); fuwSetVisible('warningLoggedOut', (this.userStatus == 'anon')); fuwSetVisible('warningNotConfirmed', (this.userStatus == 'notAutoconfirmed')); if ((this.userStatus == 'anon') || (this.userStatus == 'notAutoconfirmed')) { return; } fuwSetVisible('fuwStartScriptLink', false); // create the form element to wrap the main ScriptForm area // containing input elements of Step2 and Step3 var frm = fuwGet('fuwScriptForm'); if (! frm) { frm = document.createElement('form'); frm.id = "fuwScriptForm"; var area = fuwGet('placeholderScriptForm'); var parent = area.parentNode; parent.insertBefore(frm, area); parent.removeChild(area); frm.appendChild(area); } this.ScriptForm = frm; // create the TargetForm element that contains the filename // input box, together with hidden input controls. // This is the form that is actually submitted to the api.php. frm = fuwGet('TargetForm'); if (! frm) { frm = document.createElement('form'); frm.id = "TargetForm"; frm.method = "post"; frm.enctype = "multipart/form-data"; // "enctype" doesn't work properly on IE; need "encoding" instead: frm.encoding = "multipart/form-data"; // we'll submit via api.php, not index.php, mainly because that // allows us to use a proper edit summary different from the page content frm.action = mw.config.get('wgServer') + mw.config.get('wgScriptPath') + '/api.php'; // However, since api.php sends back a response page that humans won't want to read, // we'll have to channel that response away and discard it. We'll use a hidden iframe // for that purpose. // Unfortunately, it doesn't seem possible to submit file upload content through an // Xmlhtml object via Ajax. frm.target = "TargetIFrame"; //testing: //frm.target = "_blank"; var area = fuwGet('placeholderTargetForm'); var parent = area.parentNode; parent.insertBefore(frm, area); parent.removeChild(area); frm.appendChild(area); } this.TargetForm = frm; // For the testing version, create a third form that will display // the contents to be submitted, at the bottom of the page if (fuwTesting) { frm = fuwGet('fuwTestForm'); if (! frm) { frm = document.createElement('form'); frm.id = "fuwTestForm"; var area = fuwGet('placeholderTestForm'); var parent = area.parentNode; parent.insertBefore(frm, area); parent.removeChild(area); frm.appendChild(area); } this.TestForm = frm; } // objects to hold cached results during validation and processing this.opts = { }; this.warn = { }; // create the input filename box var filebox = document.createElement('input'); filebox.id = 'file'; filebox.name = 'file'; filebox.type = 'file'; filebox.size = fuwDefaultTextboxLength; filebox.onchange = fuwValidateFile; filebox.accept = 'image/png,image/jpeg,image/gif,image/svg+xml,image/tiff,image/x-xcf,application/pdf,image/vnd.djvu,audio/ogg,video/ogg,audio/rtp-midi'; fuwAppendInput('file', filebox); // create hidden controls for sending the remaining API parameters: fuwMakeHiddenfield('action', 'upload', 'apiAction'); fuwMakeHiddenfield('format', 'xml', 'apiFormat'); fuwMakeHiddenfield('filename', '', 'apiFilename'); fuwMakeHiddenfield('text', '', 'apiText'); fuwMakeHiddenfield('comment', '', 'apiComment'); fuwMakeHiddenfield('token', mw.user.tokens.get('editToken'), 'apiToken'); fuwMakeHiddenfield('ignorewarnings', 1, 'apiIgnorewarnings'); fuwMakeHiddenfield('watch', 1, 'apiWatch'); if (fuwTesting) { fuwMakeHiddenfield('title', mw.config.get('wgPageName') + "/sandbox", 'SandboxTitle'); fuwMakeHiddenfield('token', mw.user.tokens.get('editToken'), 'SandboxToken'); fuwMakeHiddenfield('recreate', 1, 'SandboxRecreate'); } // create a hidden IFrame to send the api.php response to var ifr = document.createElement('iframe'); ifr.id = "TargetIFrame"; ifr.name = "TargetIFrame"; //ifr.setAttribute('style', 'float:right;width:150px;height:150px;'); ifr.style.display = "none"; ifr.src = ""; fuwAppendInput('TargetIFrame', ifr); if (fuwTesting) { // create the sandbox submit button btn = document.createElement('input'); btn.id = 'SandboxButton'; btn.value = 'Sandbox'; btn.name = 'Sandbox'; btn.disabled = true; btn.type = 'button'; btn.style.width = '12em'; btn.onclick = fuwSubmitSandbox; fuwAppendInput('SandboxButton', btn); } // create the real submit button btn = document.createElement('input'); btn.id = "SubmitButton"; btn.value = "Upload"; btn.name = "Upload"; btn.disabled = true; btn.type = "button"; btn.onclick = fuwSubmitUpload; btn.style.width = '12em'; fuwAppendInput('SubmitButton', btn); // create the Commons submit button btn = document.createElement('input'); btn.id = "CommonsButton"; btn.value = "Upload on Commons"; btn.name = "Upload_on_Commons"; btn.disabled = true; btn.type = "button"; btn.onclick = fuwSubmitCommons; btn.style.width = '12em'; fuwAppendInput('CommonsButton', btn); // create reset buttons for (i = 1; i<=2; i++) { btn = document.createElement('input'); btn.id = 'ResetButton' + i; btn.value = "Reset form"; btn.name = "Reset form"; btn.type = "button"; btn.onclick = fuwReset; btn.style.width = '12em'; fuwAppendInput('ResetButton' + i, btn); } // names of radio button fields var optionRadioButtons = { // top-level copyright status choice 'FreeOrNonFree' : ['OptionFree','OptionNonFree','OptionNoGood'], // main subsections under OptionFree 'FreeOptions' : ['OptionOwnWork', 'OptionThirdParty', 'OptionFreeWebsite', 'OptionPDOld', 'OptionPDOther'], // main subsections under OptionNonFree 'NonFreeOptions': ['OptionNFSubject','OptionNF3D','OptionNFExcerpt', 'OptionNFCover','OptionNFLogo','OptionNFPortrait', 'OptionNFMisc'], // response options inside warningFileExists 'FileExistsOptions': ['NoOverwrite','OverwriteSame','OverwriteDifferent'], // choice of evidence in OptionThirdParty subsection 'ThirdPartyEvidenceOptions' : ['ThirdPartyEvidenceOptionLink', 'ThirdPartyEvidenceOptionOTRS', 'ThirdPartyEvidenceOptionOTRSForthcoming', 'ThirdPartyEvidenceOptionNone'], // choice of PD status in OptionPDOld subsection 'PDOldOptions' : ['PDUS1923','PDURAA','PDFormality','PDOldOther'], // choice of PD status in OptionPDOther subsection 'PDOtherOptions': ['PDOtherUSGov','PDOtherOfficial','PDOtherSimple', 'PDOtherOther'], // whether target article is wholly or only partly dedicated to discussing non-free work: 'NFSubjectCheck': ['NFSubjectCheckDedicated','NFSubjectCheckDiscussed'], 'NF3DCheck' : ['NF3DCheckDedicated','NF3DCheckDiscussed'], // choice about copyright status of photograph in OptionNF3D 'NF3DOptions' : ['NF3DOptionFree','NF3DOptionSame'] }; for (var group in optionRadioButtons) { var op = optionRadioButtons[group]; for (i=0; i<op.length; i++) { fuwMakeRadiobutton(group, op[i]); } } this.ScriptForm.NoOverwrite.checked = true; // input fields that trigger special // onchange() event handlers for validation: fuwMakeTextfield('InputName', fuwValidateFilename); fuwMakeTextfield('NFArticle', fuwValidateNFArticle); // names of input fields that trigger normal // validation event handler var activeTextfields = [ 'Artist3D','Country3D', 'Date','OwnWorkCreation','OwnWorkPublication', 'Author','Source', 'Permission','ThirdPartyOtherLicense', 'ThirdPartyEvidenceLink','ThirdPartyOTRSTicket', 'FreeWebsiteOtherLicense', 'PDOldAuthorLifetime','Publication', 'PDOldCountry','PDOldPermission', 'PDOfficialPermission','PDOtherPermission', 'NFSubjectPurpose', 'NF3DOrigDate', 'NF3DPurpose', 'NF3DCreator', 'NFPortraitDeceased', 'EditSummary' ]; for (i=0; i<activeTextfields.length; i++) { fuwMakeTextfield(activeTextfields[i]); } // names of multiline textareas var activeTextareas = [ 'InputDesc','NF3DPermission', 'NFCommercial','NFPurpose','NFReplaceableText', 'NFReplaceable','NFCommercial','NFMinimality','AnyOther' ]; for (i=0; i<activeTextareas.length; i++) { fuwMakeTextarea(activeTextareas[i]); }; var checkboxes = [ 'NFCoverCheckDedicated','NFLogoCheckDedicated','NFPortraitCheckDedicated' ]; for (i=0; i<checkboxes.length; i++) { fuwMakeCheckbox(checkboxes[i]); }; var licenseLists = { 'OwnWorkLicense' : // array structure as expected for input to fuwMakeSelection() function. // any entry that is a two-element array will be turned into an option // (first element is the value, second element is the display string). // Entries that are one-element arrays will be the label of an option group. // Zero-element arrays mark the end of an option group. [ ['Allow all use as long as others credit you and share it under similar conditions'], ['self|GFDL|cc-by-sa-4.0|migration=redundant', 'Creative Commons Attribution-Share Alike 4.0 + GFDL (recommended)', true], ['self|cc-by-sa-4.0', 'Creative Commons Attribution-Share Alike 4.0'], [], ['Allow all use as long as others credit you'], ['self|cc-by-4.0', 'Creative Commons Attribution 4.0'], [], ['Reserve no rights'], ['cc-zero', 'CC-zero Universal Public Domain Dedication'], [] ], 'ThirdPartyLicense' : [ ['', 'please select the correct license...'], ['Freely licensed:'], ['cc-by-sa-4.0', 'Creative Commons Attribution-Share Alike (cc-by-sa-4.0)'], ['cc-by-4.0', 'Creative Commons Attribution (cc-by-4.0)'], ['GFDL', 'GNU Free Documentation License (GFDL)'], [], ['No rights reserved:'], ['PD-author', 'Public domain'], [], ['Other (see below)'], [] ], 'FreeWebsiteLicense' : [ ['', 'please select the correct license...'], ['Freely licensed:'], ['cc-by-sa-4.0', 'Creative Commons Attribution-Share Alike (cc-by-sa-4.0)'], ['cc-by-4.0', 'Creative Commons Attribution (cc-by-4.0)'], ['GFDL', 'GNU Free Documentation License (GFDL)'], [], ['No rights reserved:'], ['PD-author', 'Public domain'], [], ['Other (see below)'], [] ], 'USGovLicense' : [ ['PD-USGov', 'US Federal Government'], ['PD-USGov-NASA','NASA'], ['PD-USGov-Military-Navy','US Navy'], ['PD-USGov-NOAA','US National Oceanic and Atmospheric Administration'], ['PD-USGov-Military-Air_Force','US Air Force'], ['PD-USGov-Military-Army','US Army'], ['PD-USGov-CIA-WF','CIA World Factbook'], ['PD-USGov-USGS','United States Geological Survey'] ], 'IneligibleLicense' : [ ['', 'please select one...'], ['PD-shape','Item consists solely of simple geometric shapes'], ['PD-text','Item consists solely of a few individual words or letters'], ['PD-textlogo','Logo or similar item consisting solely of letters and simple geometric shapes'], ['PD-chem','Chemical structural formula'], ['PD-ineligible','Other kind of item that contains no original authorship'] ], 'NFSubjectLicense' : [ ['', 'please select one...'], ['Non-free 2D art', '2-dimensional artwork (painting, drawing etc.)'], ['Non-free historic image', 'Unique historic photograph'], ['Non-free fair use in', 'something else (please describe in description field on top)'] ], 'NF3DLicense' : [ ['', 'please select one...'], ['Non-free architectural work', 'Architectural work'], ['Non-free 3D art', 'Other 3-dimensional creative work (sculpture etc.)'] ], 'NFCoverLicense' : [ ['', 'please select one...'], ['Non-free book cover', 'Cover page of a book'], ['Non-free album cover', 'Cover of a sound recording (album, single, song, CD)'], ['Non-free game cover', 'Cover of a video/computer game'], ['Non-free magazine cover', 'Cover page of a magazine'], ['Non-free video cover', 'Cover of a video'], ['Non-free software cover', 'Cover of a software product'], ['Non-free product cover', 'Cover of some commercial product'], ['Non-free title-card', 'Title screen of a TV programme'], ['Non-free movie poster', 'Movie poster'], ['Non-free poster', 'Official poster of an event'], ['Non-free fair use in', 'something else (please describe in description field on top)'] ], 'NFExcerptLicense' : [ ['', 'please select one...'], ['Non-free television screenshot', 'Television screenshot'], ['Non-free film screenshot', 'Movie screenshot'], ['Non-free game screenshot', 'Game screenshot'], ['Non-free video screenshot', 'Video screenshot'], ['Non-free music video screenshot', 'Music video screenshot'], ['Non-free software screenshot', 'Software screenshot'], ['Non-free web screenshot', 'Website screenshot'], ['Non-free speech', 'Audio excerpt from a speech'], ['Non-free audio sample', 'Sound sample of an audio recording'], ['Non-free video sample', 'Sample extract from a video'], ['Non-free sheet music', 'Sheet music representing a musical piece'], ['Non-free comic', 'Panel from a comic, graphic novel, manga etc.'], ['Non-free computer icon', 'Computer icon'], ['Non-free newspaper image', 'Page from a newspaper'], ['Non-free fair use in', 'something else (please describe in description field on top)'] ], 'NFLogoLicense' : [ ['Non-free logo', 'Logo of a company, organization etc.'], ['Non-free seal', 'Official seal, coat of arms etc'], ['Non-free symbol', 'Other official symbol'] ], 'NFMiscLicense' : [ ['Non-free fair use in', 'something else (please describe in description field on top)'], ['Non-free historic image', 'Historic photograph'], ['Non-free 2D art', '2-dimensional artwork (painting, drawing etc.)'], ['Non-free currency', 'Depiction of currency (banknotes, coins etc.)'], ['Non-free architectural work', 'Architectural work'], ['Non-free 3D art', 'Other 3-dimensional creative work (sculpture etc.)'], ['Non-free book cover', 'Cover page of a book'], ['Non-free album cover', 'Cover of a sound recording(album, single, song, CD)'], ['Non-free game cover', 'Cover of a video/computer game'], ['Non-free magazine cover', 'Cover page of a magazine'], ['Non-free video cover', 'Cover of a video'], ['Non-free software cover', 'Cover of a software product'], ['Non-free product cover', 'Cover of some commercial product'], ['Non-free title-card', 'Title screen of a TV programme'], ['Non-free movie poster', 'Movie poster'], ['Non-free poster', 'Official poster of an event'], ['Non-free television screenshot', 'Television screenshot'], ['Non-free film screenshot', 'Movie screenshot'], ['Non-free game screenshot', 'Game screenshot'], ['Non-free video screenshot', 'Video screenshot'], ['Non-free music video screenshot', 'Music video screenshot'], ['Non-free software screenshot', 'Software screenshot'], ['Non-free web screenshot', 'Website screenshot'], ['Non-free speech', 'Audio excerpt from a speech'], ['Non-free audio sample', 'Sound sample of an audio recording'], ['Non-free video sample', 'Sample extract from a video'], ['Non-free sheet music', 'Sheet music representing a musical piece'], ['Non-free comic', 'Panel from a comic, graphic novel, manga etc.'], ['Non-free computer icon', 'Computer icon'], ['Non-free newspaper image', 'Page from a newspaper'], ['Non-free logo', 'Logo of a company, organization etc.'], ['Non-free seal', 'Official seal, coat of arms etc'], ['Non-free symbol', 'Other official symbol'], ['Non-free sports uniform', 'Sports uniform'], ['Non-free stamp', 'Stamp'] ], 'NFExtraLicense' : [ ['', 'none'], ['Crown copyright and other governmental sources'], ['Non-free Crown copyright', 'UK Crown Copyright'], ['Non-free New Zealand Crown Copyright', 'NZ Crown Copyright'], ['Non-free Canadian Crown Copyright', 'Canadian Crown Copyright'], ['Non-free AUSPIC', 'AUSPIC (Australian Parliament image database)'], ['Non-free Philippines government', 'Philippines government'], ['Non-free Finnish Defence Forces', 'Finnish Defence Forces'], [], ['Other individual sources'], ['Non-free Denver Public Library image', 'Denver Public Library'], ['Non-free ESA media', 'ESA (European Space Agency)'], [], ['Possibly public domain in other countries'], ['Non-free Old-50', 'Author died more than 50 years ago.'], ['Non-free Old-70', 'Author died more than 70 years ago.'], [], ['Some permissions granted, but not completely free'], ['Non-free promotional', 'From promotional press kit'], ['Non-free with NC', 'Permission granted, but only for educational and/or non-commercial purposes'], ['Non-free with ND', 'Permission granted, but no derivative works allowed'], ['Non-free with permission', 'Permission granted, but only for Wikipedia'], [] ] }; for (var group in licenseLists) { fuwMakeSelection(group, licenseLists[group]); } this.knownCommonsLicenses = { 'self|GFDL|cc-by-sa-all|migration=redundant' : 1, 'self|Cc-zero' : 1, 'PD-self' : 1, 'self|GFDL|cc-by-sa-4.0|migration=redundant' : 1, 'self|GFDL|cc-by-4.0|migration=redundant' : 1, 'self|GFDL|cc-by-sa-3.0|migration=redundant' : 1, 'self|GFDL|cc-by-3.0|migration=redundant' : 1, 'self|cc-by-sa-4.0' : 1, 'self|cc-by-sa-3.0' : 1, 'cc-by-sa-4.0' : 1, 'cc-by-sa-3.0' : 1, 'cc-by-sa-2.5' : 1, 'cc-by-4.0' : 1, 'cc-by-3.0' : 1, 'cc-by-2.5' : 1, 'FAL' : 1, 'PD-old-100' : 1, 'PD-old' : 1, 'PD-Art' : 1, 'PD-US' : 1, 'PD-USGov' : 1, 'PD-USGov-NASA' : 1, 'PD-USGov-Military-Navy' : 1, 'PD-ineligible' : 1, 'Attribution' : 1, 'Copyrighted free use' : 1 }; // textfields that don't react directly // to user input and are used only for assembling stuff: if (fuwTesting) { fuwMakeTextfield('SandboxSummary', function(){void(0);}); fuwMakeTextarea('SandboxText', function(){void(0);}); fuwGet('SandboxSummary').disabled="disabled"; fuwGet('SandboxText').disabled="disabled"; fuwGet('SandboxText').rows = 12; } // set links to "_blank" target, so we don't accidentally leave the page, // because on some browsers that would destroy all the input the user has already entered $('.fuwOutLink a').each(function() { this.target = '_blank'; }); // make main area visible fuwSetVisible('UploadScriptArea', true); } // ====================================== // end of fuwGlobal constructor function // ====================================== function fuwRadioClick(e) { var ev = e || event; var src = ev.target || ev.srcElement; //alert('onclick event from ' + src + ' (' + src.value + ')'); fuwUpdateOptions(); return true; } /* * ============================================================= * function fuwUpdateOptions * ============================================================= * This is the onchange event handler for most of the input * elements in the main form. It changes visibility and disabled * status for the various sections of the input form in response * to which options are chosen. */ function fuwUpdateOptions() { var fuw = window.fuw; var warn = fuw.warn; var opts = fuw.opts = { }; opts.InputFilename = $('#TargetForm input#file').val(); var widgets = fuw.ScriptForm.elements; for (i = 0; i < widgets.length; i++) { var w = widgets[i]; if (w.type == "radio") { var nm = w.name; var id = w.id; var vl = w.checked && !w.disabled && fuwIsVisible(w); opts[id] = vl; if (vl) opts[nm] = id; } else { var id = w.id; var active = !w.disabled && fuwIsVisible(w); if (active) { var value = ((type == 'checkbox') ? w.checked : w.value); opts[id] = value; } } }; opts.MainOption = opts.FreeOptions || opts.NonFreeOptions; // some parts of the input form are re-used across sections // and must be moved into the currently active input section: // minimality section is shared between all NF sections fuwMove('NFMinimalitySection', 'detailsNFSubject', (opts.OptionNFSubject)) || fuwMove('NFMinimalitySection', 'detailsNF3D', (opts.OptionNF3D)) || fuwMove('NFMinimalitySection', 'detailsNFExcerpt', (opts.OptionNFExcerpt)) || fuwMove('NFMinimalitySection', 'detailsNFCover', (opts.OptionNFCover)) || fuwMove('NFMinimalitySection', 'detailsNFLogo', (opts.OptionNFLogo)) || fuwMove('NFMinimalitySection', 'detailsNFPortrait', (opts.OptionNFPortrait)) || fuwMove('NFMinimalitySection', 'detailsNFMisc', true); // AnyOtherInfo section is shared between all fuwMove('AnyOtherInfo', 'detailsOwnWork', opts.OptionOwnWork) || fuwMove('AnyOtherInfo', 'detailsThirdParty', opts.OptionThirdParty) || fuwMove('AnyOtherInfo', 'detailsFreeWebsite', opts.OptionFreeWebsite) || fuwMove('AnyOtherInfo', 'detailsPDOld', opts.OptionPDOld) || fuwMove('AnyOtherInfo', 'detailsPDOther', opts.OptionPDOther) || fuwMove('AnyOtherInfo', 'detailsNFSubject', opts.OptionNFSubject) || fuwMove('AnyOtherInfo', 'detailsNF3D', opts.OptionNF3D) || fuwMove('AnyOtherInfo', 'detailsNFExcerpt', opts.OptionNFExcerpt) || fuwMove('AnyOtherInfo', 'detailsNFCover', opts.OptionNFCover) || fuwMove('AnyOtherInfo', 'detailsNFLogo', opts.OptionNFLogo) || fuwMove('AnyOtherInfo', 'detailsNFPortrait', opts.OptionNFPortrait) || fuwMove('AnyOtherInfo', 'detailsNFMisc', opts.OptionNFMisc); // author input field is shared between all sections except "Own Work". // (will serve for the immediate/photographic author, in those cases where there // are two author fields) fuwMove('Author', 'placeholderFreeWebsiteAuthor', (opts.OptionFreeWebsite)) || fuwMove('Author', 'placeholderPDOldAuthor', (opts.OptionPDOld)) || fuwMove('Author', 'placeholderPDOtherAuthor', (opts.OptionPDOther)) || fuwMove('Author', 'placeholderNFSubjectAuthor', (opts.OptionNFSubject)) || fuwMove('Author', 'placeholderNF3DAuthor', (opts.OptionNF3D)) || fuwMove('Author', 'placeholderNFExcerptAuthor', (opts.OptionNFExcerpt)) || fuwMove('Author', 'placeholderNFCoverAuthor', (opts.OptionNFCover)) || fuwMove('Author', 'placeholderNFPortraitAuthor', (opts.OptionNFPortrait)) || fuwMove('Author', 'placeholderNFMiscAuthor', (opts.OptionNFMisc)) || fuwMove('Author', 'placeholderAuthor', true); // source input field is shared between all sections except "Own Work". // (will serve for immediate/web source, in those cases where there are two // source fields involved) fuwMove('Source', 'placeholderFreeWebsiteSource', (opts.OptionFreeWebsite)) || fuwMove('Source', 'placeholderPDOldSource', (opts.OptionPDOld)) || fuwMove('Source', 'placeholderPDOtherSource', (opts.OptionPDOther)) || fuwMove('Source', 'placeholderNFSubjectSource', (opts.OptionNFSubject)) || fuwMove('Source', 'placeholderNF3DSource', (opts.OptionNF3D)) || fuwMove('Source', 'placeholderNFExcerptSource', (opts.OptionNFExcerpt)) || fuwMove('Source', 'placeholderNFCoverSource', (opts.OptionNFCover)) || fuwMove('Source', 'placeholderNFLogoSource', (opts.OptionNFLogo)) || fuwMove('Source', 'placeholderNFPortraitSource', (opts.OptionNFPortrait)) || fuwMove('Source', 'placeholderNFMiscSource', (opts.OptionNFMisc)) || fuwMove('Source', 'placeholderSource', true); // date input field is shared between all sections except "Logo", which doesn't need it. // will serve for derived/photographic date in the case of 3D items fuwMove('Date', 'placeholderFreeWebsiteDate', (opts.OptionFreeWebsite)) || fuwMove('Date', 'placeholderThirdPartyDate', (opts.OptionThirdParty)) || fuwMove('Date', 'placeholderPDOldDate', (opts.OptionPDOld)) || fuwMove('Date', 'placeholderPDOtherDate', (opts.OptionPDOther)) || fuwMove('Date', 'placeholderNFSubjectDate', (opts.OptionNFSubject)) || fuwMove('Date', 'placeholderNF3DDate', (opts.OptionNF3D)) || fuwMove('Date', 'placeholderNFExcerptDate', (opts.OptionNFExcerpt)) || fuwMove('Date', 'placeholderNFCoverDate', (opts.OptionNFCover)) || fuwMove('Date', 'placeholderNFPortraitDate', (opts.OptionNFPortrait)) || fuwMove('Date', 'placeholderNFMiscDate', (opts.OptionNFMisc)) || fuwMove('Date', 'placeholderDate', true); // permission field is shared between ThirdParty and FreeWebsite sections fuwMove('Permission', 'placeholderFreeWebsitePermission', (opts.OptionFreeWebsite)) || fuwMove('Permission', 'placeholderPermission', true); // publication field is shared between PDOld, NFPortrait and NFMisc fuwMove('Publication', 'placeholderNFPortraitPublication', (opts.OptionNFPortrait)) || fuwMove('Publication', 'placeholderNFMiscPublication', (opts.OptionNFMisc)) || fuwMove('Publication', 'placeholderPublication', true); // Purpose, Commercial, Replaceable and ReplaceableText FUR fields are shared // between some but not all of the non-free sections fuwMove('NFPurpose', 'placeholderNFExcerptPurpose', (opts.OptionNFExcerpt)) || fuwMove('NFPurpose', 'placeholderNFPurpose'); fuwMove('NFCommercial', 'placeholderNFPortraitCommercial', (opts.OptionNFPortrait)) || fuwMove('NFCommercial', 'placeholderNFCommercial'); fuwMove('NFReplaceable', 'placeholderNFPortraitReplaceable', (opts.OptionNFPortrait)) || fuwMove('NFReplaceable', 'placeholderNFReplaceable'); fuwMove('NFReplaceableText', 'placeholderNFExcerptReplaceable', (opts.OptionNFExcerpt)) || fuwMove('NFReplaceableText', 'placeholderNFReplaceableText', true); // submit button goes to Step1 if user has chosen a plain overwrite of an existing file, // and to the active section of Step3 if otherwise fuwMove('fuwSubmit', 'UploadScriptStep1', (warn.ImageExists && opts.OverwriteSame)) || fuwMove('fuwSubmit', 'detailsOwnWork', opts.OptionOwnWork) || fuwMove('fuwSubmit', 'detailsThirdParty', opts.OptionThirdParty) || fuwMove('fuwSubmit', 'detailsFreeWebsite', opts.OptionFreeWebsite) || fuwMove('fuwSubmit', 'detailsPDOld', opts.OptionPDOld) || fuwMove('fuwSubmit', 'detailsPDOther', opts.OptionPDOther) || fuwMove('fuwSubmit', 'detailsNFSubject', opts.OptionNFSubject) || fuwMove('fuwSubmit', 'detailsNF3D', opts.OptionNF3D) || fuwMove('fuwSubmit', 'detailsNFExcerpt', opts.OptionNFExcerpt) || fuwMove('fuwSubmit', 'detailsNFCover', opts.OptionNFCover) || fuwMove('fuwSubmit', 'detailsNFLogo', opts.OptionNFLogo) || fuwMove('fuwSubmit', 'detailsNFPortrait', opts.OptionNFPortrait) || fuwMove('fuwSubmit', 'fuwSubmitHost', true); // Show and hide warnings: // filename-related warnings: fuwSetVisible('warningIllegalChars', warn.IllegalChars); fuwSetVisible('warningBadFilename', warn.BadFilename); fuwSetVisible('warningImageOnCommons', warn.ImageOnCommons); fuwSetVisible('warningImageExists', warn.ImageExists); fuwMove('warningImageThumb', 'warningImageOnCommons', warn.ImageOnCommons, true) || fuwMove('warningImageThumb', 'warningImageExists', true, true); // notices related to the top-level options: fuwSetVisible('warningWhyNotCommons', opts.OptionFree); fuwSetVisible('warningNF', opts.OptionNonFree); fuwSetVisible('warningNoGood', opts.OptionNoGood); // warnings related to non-free "used in" article fuwSetVisible('warningNFArticleNotFound', warn.NFArticleNotFound); fuwSetVisible('warningNFArticleNotMainspace', warn.NFArticleNotMainspace); fuwSetVisible('warningUserspaceDraft', warn.UserspaceDraft); fuwSetVisible('warningNFArticleDab', warn.NFArticleDab); fuwSetVisible('NFArticleOK', warn.NFArticleOK); // warnings depending on user status: if (fuw.userStatus.match(/problem|newbie|notAutoconfirmed/)) { fuwSetVisible('warningFreeWebsite', opts.OptionFreeWebsite); fuwSetVisible('warningOwnWork', opts.OptionOwnWork); fuwSetVisible('warningPDOther', opts.OptionPDOther); fuwSetVisible('warningNFSubject', opts.OptionNFSubject); }
pyr0mind
AI-PDF-Agent converts PDFs into queryable data using AI. Extracts text (PyMuPDF), generates embeddings, and answers questions via LLMs (OpenAI). Modular for custom vector databases (Pinecone/FAISS). Ideal for research/legal analysis. MIT-licensed. Clone, install dependencies, and query documents. Open-source contributions enhance preprocessing.
atlasunified
This project is designed to automate the process of downloading and processing large datasets from the web: specifically, it scrapes and downloads .snappy.parquet files, converts them to CSV, extracts URLs, downloads associated PDFs, performs OCR on the PDFs to extract text and bounding boxes, and finally organizes and archives the data.
allanninal
The AI Resume Analyzer is a powerful tool that leverages Hugging Face’s dslim/bert-base-NER model to extract key details like skills, certifications, and organizations from resumes. It supports PDF and text files, processes data in real-time, and presents structured, easy-to-review insights for job seekers and recruiters.
Open-Source-Legal
Collection of tools to provide text extract outputs for all PDFs that include x,y coordinate data as well as text
shaadclt
This repository demonstrates how to extract text, images, and structured content from PDF documents using pymupdf4llm in Google Colab. It also includes data preparation for LlamaIndex for further document analysis and information extraction.
anjesh
Processes the pdf and finds whether it's text based or not and extracts data from the pdf in either case
CephasTechOrg
AI PDF Analyzer is a web application that allows users to upload PDF documents and instantly analyze their content using AI. The system processes PDFs, extracts text and data, and provides intelligent insights, summaries, and answers to user queries.
SyedAliHaider02
Lexa is a Flask-based web application designed to perform topic modeling on text data extracted from PDF files or input directly via a text box. Using advanced natural language processing (NLP) techniques, Lexa leverages Latent Dirichlet Allocation (LDA) to uncover hidden topics within the provided text.
EmmanuelLwele
Interview Coding Challenge Data Science Step 1 of the Data Scientist Interview process. Follow the instructions below to complete this portion of the interview. Please note, although we do not set a time limit for this challenge, we recommend completing it as soon as possible as we evaluate candidates on a first come, first serve basis... If you have any questions, please feel free to email support@TheZig.io. We will do our best to clarify any issues you come across. Prerequisites: A Text Editor - We recommend Visual Studio Code for the ClientSide code, its lightweight, powerful and Free! (https://code.visualstudio.com/) SQL Server Management Studio (https://docs.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-2017) R - Software Environment for statitistal computing and graphics. You can download R at the mirrors listed here (https://cran.r-project.org/mirrors.html) Azure - Microsoft's Cloud Computing platform. You can create an account without a credit card by using the Azure Pass available at this link (https://azure.microsoft.com/en-us/offers/azure-pass/) Git - For source control and committing your final solution to a new private repo (https://git-scm.com/downloads) a. If you're not very familiar with git commands, here's a helpful cheatsheet (https://services.github.com/on-demand/downloads/github-git-cheat-sheet.pdf) 'R' Challenge For each numbered section below, write R code and comments to solve the problem or to show your rationale. For sections that ask you to give outputs, provide outputs in separate files and name them with the section number and the word output "Section 1 - Output". Create a private repo and submit your modified R script along with any supporting files. Load in the dataset from the accompanying file "account-defaults.csv" This dataset contains information about loan accounts that either went delinquent or stayed current on payments within the loan's first year. FirstYearDelinquency is the outcome variable, all others are predictors. The objective of modeling with this dataset is to be able to predict the probability that new accounts will become delinquent; it is primarily valuable to understand lower-risk accounts versus higher-risk accounts (and not just to predict 'yes' or 'no' for new accounts). FirstYearDelinquency - indicates whether the loan went delinquent within the first year of the loan's life (values of 1) AgeOldestIdentityRecord - number of months since the first record was reported by a national credit source AgeOldestAccount - number of months since the oldest account was opened AgeNewestAutoAccount - number of months since the most recent auto loan or lease account was opened TotalInquiries - total number of credit inquiries on record AvgAgeAutoAccounts - average number of months since auto loan or lease accounts were opened TotalAutoAccountsNeverDelinquent - total number of auto loan or lease accounts that were never delinquent WorstDelinquency - worst status of days-delinquent on an account in the first 12 months of an account's life; values of '400' indicate '400 or greater' HasInquiryTelecomm - indicates whether one or more telecommunications credit inquires are on record within the last 12 months (values of 1) Perform an exploratory data analysis on the accounts data In your analysis include summary statistics and visualizations of the distributions and relationships. Build one or more predictive model(s) on the accounts data using regression techniques Identify the strongest predictor variables and provide interpretations. Identify and explain issues with the model(s) such as collinearity, etc. Calculate predictions and show model performance on out-of-sample data. Summarize out-of-sample data in tiers from highest-risk to lowest-risk. Split up the dataset by the WorstDelinquency variable. For each subset, run a simple regression of FirstYearDelinquency ~ TotalInquiries. Extract the predictor's coefficient and p-value from each model. Store the in a list where the names of the list correspond to the values of WorstDelinquency. Load in the dataset from the accompanying file "vehicle-depreciation.csv". The dataset contains information about vehicles that our company purchases at auction, sells to customers, repossess from defaulted accounts, and finally re-sell at auction to recover some of our losses. Perform an analysis and/or build a predictive model that provides a method to estimate the depreciation of vehicle worth (from auction purchase to auction sale). Use whatever techniques you want to provide insight into the dataset and walk us through your results - this is your chance to show off your analytical and storytelling skills! CustomerGrade - the credit risk grade of the customer AuctionPurchaseDate - the date that the vehicle was purchased at auction AuctionPurchaseAmount - the dollar amount spent purchasing the vehicle at auction AuctionSaleDate - the date that the vehicle was sold at auction AuctionSaleAmount - the dollar amount received for selling the vehicle at auction VehicleType - the high-level class of the vehicle Year - the year of the vehicle Make - the make of the vehicle Model - the model of the vehicle Trim - the trim of the vehicle BodyType - the body style of the vehicle AuctionPurchaseOdometer - the odometer value of the vehicle at the time of purchase at the auction AutomaticTransmission - indicates (with value of 1) whether the vehicle has an automatic transmission DriveType - the drivetrain type of the vehicle
PDF Data Processing and Semantic Embedding - This repository offers a robust solution for extracting text from PDF files, generating semantic embeddings using OpenAI's API, and storing these embeddings in Pinecone for sophisticated vector-based search capabilities.
climatepolicyradar
A wrapper around the azure text extraction api for extracting text from pdfs and converting to our data model at CPR.