Extract data from tables in pdf documents

Example of pdf document which contains a table.

The web offers many interesting documents in Adobe pdf format. Quite often the documents do contain nice tables of data. To visualize this data the way I like to, I first need to extract the data from the pdf document and store it in a processable format. For example Excel or comma separated values (csv). Here is an example of how I did this.

The wrong way… at least for big tables

The document I use contains one big table divided over more than 100 pages. My first thought was to manually copy and paste the data from the table. But by using select all, copy and paste the data to an Open Office Spreadsheet (or Excel) I lost all formatting of the table. No option. Same problem by using Adobe reader and the menu option File – Save as Text. Of course for a small table this method works fine. In this case you can quickly reformat the data manually.

How I succeeded for big tables

In two steps I managed to get all data (about 2000 rows) in one spreadsheet. First I did use this pdf to xls conversion tool on the web. The result: an Excel document with 118 sheets, one sheet for every page of the pdf document. Of course I did not feel like spending my time on the boring job of repeating copy and paste for 118 times. I got this Excel Visual Basic macro from the web and was able to combine the all the sheets automatically into one sheet.

Minor problems

In both steps I encountered a minor problem. The pdf conversion service works with a huge delay. It took many hours before I did receive my email with the result in the attachment. The Excel visual basic macro I took from the web did not work perfectly at once. Problem seemed to be the active cell on each of the individual sheets. After selecting all sheets together and making A1 the active cell on all of them, the macro did work excellent.

This entry was posted in Companies, Data Alchemist Projects, Data Alchemist Tools, Data retrieval and tagged , , . Bookmark the permalink.

2 Responses to Extract data from tables in pdf documents

  1. Roy says:

    I see you share interesting stuff here, you can earn some additional cash, your blog has huge potential, for the monetizing method, just type in google – K2 advices how to monetize a website

  2. June says:

    I read a lot of interesting posts here. Probably you spend a lot of time writing,
    i know how to save you a lot of work, there is an online tool that creates readable, google friendly posts in minutes, just search in google – laranitas free content source

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>