Skip to content

Commit e3150e3

Browse files
authored
DRILL-8028: Add PDF Format Plugin (#2359)
* Initial commit * WIP * Regular queries working * Metadata fields working * Minor fixes * Fixed unit test * Added additional closing functions. * WIP * Fixed Headless Issue * Updated to Drill 1.20 * Added option to merge pages * Ready for PR * Removed struts * WIP * Progress.. * UTs all passing * Fix Duplicate Page Issue * Fixed extract headers * Refactored Tables and Added Metadata class * Added UT * Code cleanup * New UTs * Added UTs * Added UT and removed extra test files * Removed comment * Removed comment * Bump pdfbox to latest version * Moved Java config to drill-config.sh
1 parent 5dea409 commit e3150e3

29 files changed

+1847
-0
lines changed

contrib/format-pdf/README.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# Format Plugin for PDF Table Reader
2+
One of the most annoying tasks is when you are working on a data science project and you get data that is in a PDF file. This plugin endeavours to enable you to query data in PDF tables using Drill's SQL interface.
3+
4+
## Data Model
5+
Since PDF files generally are not intended to be queried or read by machines, mapping the data to tables and rows is not a perfect process. The PDF reader does support
6+
provided schema. You can read about Drill's [provided schema functionality here](https://drill.apache.org/docs/plugin-configuration-basics/#specifying-the-schema-as-table-function-parameter)
7+
8+
9+
### Merging Pages
10+
The PDF reader reads tables from PDF files on each page. If your PDF file has tables that span multiple pages, you can set the `combinePages` parameter to `true` and Drill
11+
will merge all the tables in the PDF file. You can also do this at query time with the `table()` function.
12+
13+
## Configuration
14+
To configure the PDF reader, simply add the information below to the `formats` section of a file based storage plugin, such as `dfs`, `hdfs` or `s3`.
15+
16+
```json
17+
"pdf": {
18+
"type": "pdf",
19+
"extensions": [
20+
"pdf"
21+
],
22+
"extractionAlgorithm": "spreadsheet",
23+
"extractHeaders": true,
24+
"combinePages": false
25+
}
26+
```
27+
The available options are:
28+
* `extractHeaders`: Extracts the first row of any tables as the header row. If set to `false`, Drill will assign column names of `field_0`, `field_1` to each column.
29+
* `combinePages`: Merges multi page tables together.
30+
* `defaultTableIndex`: Allows you to query different tables within the PDF file. Index starts at `1`.
31+
* `extractionAlgorithm`: Allows you to choose the extraction algorithm used for extracting data from the PDF file. Choices are `spreadsheet` and `basic`. Depending on your data, one may work better than the other.
32+
33+
## Accessing Document Metadata Fields
34+
PDF files have a considerable amount of metadata which can be useful for analysis. Drill will extract the following fields from every PDF file. Note that these fields are not projected in star queries and must be selected explicitly. The document's creator populates these fields and some or all may be empty. With the exception of `_page_count` which is an `INT` and the two date fields, all the other fields are `VARCHAR` fields.
35+
36+
The fields are:
37+
* `_page_count`
38+
* `_author`
39+
* `_title`
40+
* `_keywords`
41+
* `_creator`
42+
* `_producer`
43+
* `_creation_date`
44+
* `_modification_date`
45+
* `_trapped`
46+
* `_table_count`
47+
48+
The query below will access a document's metadata:
49+
50+
```sql
51+
SELECT _page_count, _title, _author, _subject,
52+
_keywords, _creator, _producer, _creation_date,
53+
_modification_date, _trapped
54+
FROM dfs.`pdf/20.pdf`
55+
```
56+
The query below demonstrates how to define a schema at query time:
57+
58+
```sql
59+
SELECT * FROM table(cp.`pdf/schools.pdf` (type => 'pdf', combinePages => true,
60+
schema => 'inline=(`Last Name` VARCHAR, `First Name Address` VARCHAR,
61+
`field_0` VARCHAR, `City` VARCHAR, `State` VARCHAR, `Zip` VARCHAR,
62+
`field_1` VARCHAR, `Occupation Employer` VARCHAR,
63+
`Date` VARCHAR, `field_2` DATE properties {`drill.format` = `M/d/yyyy`},
64+
`Amount` DOUBLE)'))
65+
LIMIT 5
66+
```
67+
68+
### Encrypted Files
69+
If a PDF file is encrypted, you can supply the password to the file via the `table()` function as shown below. Note that the password will be recorded in any query logs that
70+
may exist.
71+
72+
```sql
73+
SELECT *
74+
FROM table(dfs.`encrypted_pdf.pdf`(type => 'pdf', password=> 'your_password'))
75+
```

contrib/format-pdf/pom.xml

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
<?xml version="1.0"?>
2+
<!--
3+
4+
Licensed to the Apache Software Foundation (ASF) under one
5+
or more contributor license agreements. See the NOTICE file
6+
distributed with this work for additional information
7+
regarding copyright ownership. The ASF licenses this file
8+
to you under the Apache License, Version 2.0 (the
9+
"License"); you may not use this file except in compliance
10+
with the License. You may obtain a copy of the License at
11+
12+
http://www.apache.org/licenses/LICENSE-2.0
13+
14+
Unless required by applicable law or agreed to in writing, software
15+
distributed under the License is distributed on an "AS IS" BASIS,
16+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
17+
See the License for the specific language governing permissions and
18+
limitations under the License.
19+
20+
-->
21+
<project xmlns="http://maven.apache.org/POM/4.0.0"
22+
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
23+
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
24+
<modelVersion>4.0.0</modelVersion>
25+
26+
<parent>
27+
<artifactId>drill-contrib-parent</artifactId>
28+
<groupId>org.apache.drill.contrib</groupId>
29+
<version>1.20.0-SNAPSHOT</version>
30+
</parent>
31+
32+
<artifactId>drill-format-pdf</artifactId>
33+
<name>Drill : Contrib : Format : PDF</name>
34+
35+
<dependencies>
36+
<dependency>
37+
<groupId>org.apache.drill.exec</groupId>
38+
<artifactId>drill-java-exec</artifactId>
39+
<version>${project.version}</version>
40+
</dependency>
41+
<dependency>
42+
<groupId>technology.tabula</groupId>
43+
<artifactId>tabula</artifactId>
44+
<version>1.0.5</version>
45+
<exclusions>
46+
<exclusion>
47+
<artifactId>slf4j-simple</artifactId>
48+
<groupId>org.slf4j</groupId>
49+
</exclusion>
50+
</exclusions>
51+
</dependency>
52+
<dependency>
53+
<groupId>org.apache.pdfbox</groupId>
54+
<artifactId>pdfbox</artifactId>
55+
<version>2.0.25</version>
56+
<exclusions>
57+
<exclusion>
58+
<groupId>commons-logging</groupId>
59+
<artifactId>commons-logging</artifactId>
60+
</exclusion>
61+
</exclusions>
62+
</dependency>
63+
<!-- Test dependencies -->
64+
<dependency>
65+
<groupId>org.apache.drill.exec</groupId>
66+
<artifactId>drill-java-exec</artifactId>
67+
<classifier>tests</classifier>
68+
<version>${project.version}</version>
69+
<scope>test</scope>
70+
</dependency>
71+
<dependency>
72+
<groupId>org.apache.drill</groupId>
73+
<artifactId>drill-common</artifactId>
74+
<classifier>tests</classifier>
75+
<version>${project.version}</version>
76+
<scope>test</scope>
77+
</dependency>
78+
</dependencies>
79+
<build>
80+
<plugins>
81+
<plugin>
82+
<artifactId>maven-resources-plugin</artifactId>
83+
<executions>
84+
<execution>
85+
<id>copy-java-sources</id>
86+
<phase>process-sources</phase>
87+
<goals>
88+
<goal>copy-resources</goal>
89+
</goals>
90+
<configuration>
91+
<outputDirectory>${basedir}/target/classes/org/apache/drill/exec/store/pdf
92+
</outputDirectory>
93+
<resources>
94+
<resource>
95+
<directory>src/main/java/org/apache/drill/exec/store/pdf</directory>
96+
<filtering>true</filtering>
97+
</resource>
98+
</resources>
99+
</configuration>
100+
</execution>
101+
</executions>
102+
</plugin>
103+
</plugins>
104+
</build>
105+
</project>

0 commit comments

Comments
 (0)