Automated Data Mining/Extraction from Online PDFs

Cancelled Posted Nov 21, 2011 Paid on delivery
Cancelled Paid on delivery

**Full description is attached.**

DO NOT APPLY FOR THIS JOB IF YOU HAVE NOT READ THE ENTIRE DESCRIPTION.

THIS IS AN AUTOMATED DATA MINING/EXTRACTION JOB -- NOT A MANUAL ONE.

We are looking for a contractor having solid experience with software and development for data extraction from online PDFs. The PDFs are scanned copies of IRS forms that have been filed by charities, and are available through a single public online source. There are six different types of forms. The IRS scanning process can result in different positioning of data among scanned forms and scans of different quality.

We want someone who has demonstrated a history of substantial, successful data mining using PDFs and OCR. If you are looking to learn or expand your profile, this is not for you. Fluent English is a must.

The contractor must develop a program that will do the following:

1. Download scanned PDFs of mixed quality from the online source using a list of URLs in a text file provided by the buyer (approximately 300,000 PDFs and URLs).

2. Extract up to ten numeric and text data fields from each PDF using a combination of automated graphical manipulation and OCR. The location of the data on the pages will be different for each of the 6 types of forms.

3. Incorporate error-checking based on related data fields selected by buyer.

4. Format the data output as a CSV to be uploaded to buyer's SQL database.

5. Provide well-commented source code and an executable. The program will be run on an ongoing basis by the buyer.

6. Deliver written step-by-step operating instructions that a novice user can readily understand and follow.

7. Pass the following accuracy tests when operated by the buyer: Based on 10,000 URLs chosen by the buyer, the program will (a) download 100% of the PDFs and (b) correctly extract from the downloaded PDFs 90% or greater of the designated data fields, with the error-checking identifying all data fields where extraction failed.

**Full description is attached.**

Engineering Microsoft Project Management Script Install Shell Script Software Architecture Software Testing Windows Desktop

Project ID: #3710190

About the project

1 proposal Remote project Active Dec 5, 2011

1 freelancer is bidding on average $5001 for this job

matfizvw

See private message.

$5000.55 USD in 21 days
(54 Reviews)
6.0