It seems about election data write in indic character such MANGAL. I'm not sure I will be able perfect and accuratly to extract thousand its PDF without my PC REGIOANAL is setting to india :):) There is lot of FlateDecode stream inside PDF,it must be take care to handle it to get the original entities and attributs. Better you use scripting lang such python, perl etc, it more simpler than C,C++/C#, because we need crawling steps to : Indexing first, normalyzing, unconsistence check and potential error in some pages, etc. Hopefully you will be finishing your problem soon.
Tks & BR