Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
6c6f7b4
Add primary pipeline for labeling subfigures.
Adibvafa Sep 12, 2024
a8b7354
Add init files.
Adibvafa Sep 12, 2024
f5cd900
Add prompts directory.
Adibvafa Sep 12, 2024
f27b0e0
Add primary subcaption pipeline.
Adibvafa Sep 12, 2024
ece305a
Add subcaption pipeline.
Adibvafa Sep 13, 2024
9cf01ac
Add subfigure pipeline.
Adibvafa Sep 13, 2024
e8bc0a2
Add helper modules for loading trained models.
Adibvafa Sep 13, 2024
3ee7525
Fix import path.
Adibvafa Sep 19, 2024
b8ee4ec
Add working revised pipeline for subfigure detection.
Adibvafa Sep 19, 2024
b7e12c8
Finalize subcaption pipeline.
Adibvafa Sep 26, 2024
5e1f371
Finalize subfigure separation pipeline.
Adibvafa Sep 26, 2024
80b77b2
Add primary ocr pipeline.
Adibvafa Sep 26, 2024
408ec79
Add playground code to use ocr pipeline.
Adibvafa Sep 26, 2024
6715d5d
Add classification to subfigure pipeline.
Adibvafa Oct 7, 2024
cbc86e4
Update readme.
Adibvafa Oct 7, 2024
581f8e3
Revise prior code detecting subfigures.
Adibvafa Oct 7, 2024
1f4b0f6
Move files to granular directory.
Adibvafa Oct 7, 2024
94893c1
Fix merge conflicts.
Adibvafa Oct 7, 2024
60630bb
Add preprocessing pipeline to prepare dataset and filter medical images.
Adibvafa Oct 10, 2024
cbc76f1
Enhance pipeline style.
Adibvafa Oct 12, 2024
4a8adc9
Add modality classification script.
Adibvafa Oct 12, 2024
6ae2b62
Complete the set of classes.
Adibvafa Oct 12, 2024
0a9fcbb
Update the set of keywords used for initial filtering of medical images.
Adibvafa Oct 12, 2024
57344ed
Increase preprocessing pipeline efficiency.
Adibvafa Oct 13, 2024
9da8930
Use ProcessPoolExecutor for preprocessing dataset.
Adibvafa Oct 13, 2024
2a70018
Add pipeline to classify subfigures.
Adibvafa Oct 13, 2024
a86a13e
Refactor classification pipeline from separation pipeline.
Adibvafa Oct 13, 2024
5e7857e
Remove dead code.
Adibvafa Jan 6, 2025
e1067ca
Add subfigure detector refactored model.
Adibvafa Jan 6, 2025
e5a308e
Add transformer modules.
Adibvafa Jan 6, 2025
7b81d9a
Remove prompts directory.
Adibvafa Jan 6, 2025
2e6d0ff
Finalize subfigure pipeline.
Adibvafa Jan 6, 2025
c8bba1d
pipeline/subcaption.py
Adibvafa Jan 6, 2025
2fee5e9
Add subcaption pipeline.
Adibvafa Jan 6, 2025
995c83f
Add subfigure classification pipeline.
Adibvafa Jan 6, 2025
ac5bd0a
Finalize preprocessing pipeline.
Adibvafa Jan 6, 2025
2c0e583
Add alignment pipeline.
Adibvafa Jan 6, 2025
0fb6462
Add config for yolov model.
Adibvafa Jan 6, 2025
9bd713a
Add path to checkpoints in classifier and subfigure model.
Adibvafa Jan 6, 2025
980b677
Add subfigure detection and ocr models.
Adibvafa Jan 6, 2025
3906b31
Add checkpoints directory.
Adibvafa Jan 6, 2025
1791643
Merge branch 'main' of github.com:VectorInstitute/pmc-data-extraction…
Adibvafa Jan 6, 2025
88b7ab6
Add missing init file to config directory.
Adibvafa Jan 6, 2025
78ff49b
Remove biomedclip classification logic.
Adibvafa Jan 9, 2025
b71e933
Remove original loop from align.sh
Adibvafa Jan 9, 2025
d75690f
Remove dead code in model files.
Adibvafa Jan 9, 2025
5e6c48f
Remove deadcode from pipeline.
Adibvafa Jan 9, 2025
a43c3bf
Remove histogram notebook.
Adibvafa Jan 9, 2025
045233b
Update align.sh: Replace hardcoded directories with environment varia…
Adibvafa Jan 9, 2025
a1ab5d6
Update classify.sh: Replace hardcoded directories with environment va…
Adibvafa Jan 9, 2025
cd3ed1f
Update preprocess.sh: Replace hardcoded directories with environment …
Adibvafa Jan 9, 2025
3a00710
Update subcaption.sh: Replace hardcoded directories with environment …
Adibvafa Jan 9, 2025
4937b5e
Update subfigure.sh: Replace hardcoded directories with environment v…
Adibvafa Jan 9, 2025
6990858
Remove old util files.
Adibvafa Jan 14, 2025
ec99fe7
Add readme on how the granular pipeline should be used.
Adibvafa Jan 14, 2025
2da05e1
Improve the file naming convention in sh files.
Adibvafa Jan 14, 2025
1c365bd
Improve style.
Adibvafa Jan 14, 2025
58a9148
Add subcaption pipeline.
Adibvafa Jan 14, 2025
7600a16
Add guide to use the subcaption.ipynb for subcaption pipeline.
Adibvafa Jan 14, 2025
af738ff
Remove files with explicit directories.
Adibvafa Jan 14, 2025
616b6ef
Add sample command for .sh files.
Adibvafa Jan 14, 2025
abff95c
Add file description to sh files.
Adibvafa Jan 14, 2025
70a9eb1
Improve style.
Adibvafa Jan 14, 2025
a671e51
Fix style issues.
Adibvafa Jan 18, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 107 additions & 3 deletions openpmcvl/granular/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,108 @@
# Granular Package
# **Granular Pipeline**
Our goal is to create a finegrained dataset of biomedical subfigure-subcaption pairs from the raw dataset of PMC figure-caption pairs. We assume that a dataset of PMC figure-caption pairs, e.g. PMC-17M, is already downloaded, formatted as a directory of JSONL files and a directory of image .jpg files. Note that all .sh files require you to pass in the JSONL numbers from the PMC dataset as arguments.

This package contains tools to extract sub-figures and sub-captions from downloaded image-caption pairs.
This enlarges the dataset, and may increase the quality of the data as well since the sub-pairs will be more focused and less confusing.
Sample command:
```bash
sbatch openpmcvl/granular/pipeline/preprocess.sh 0 1 2 3 4 5 6 7 8 9 10 11
```


## **1. Preprocess**
> **Code:** `preprocess.py & preprocess.sh` <br>
> **Input:** Directory of figures and PMC metadata in JSONL format <br>
> **Output:** Filtered figure-caption pairs in JSONL format (`${num}_meta.jsonl`) <br>

- Filter out figure-caption pairs that are not .jpg images, missing, or corrupted.
- Filter for figure-caption pairs that contain target biomedical keywords.

Each datapoint contains the following fields:
- `id`: A unique identifier for the figure-caption pair.
- `PMC_ID`: The PMC ID of the article.
- `caption`: The caption of the figure.
- `image_path`: The path to the image file.
- `width`: The width of the image in pixels.
- `height`: The height of the image in pixels.
- `media_id`: The ID of the media file.
- `media_url`: The URL of the media file.
- `media_name`: The name of the media file.
- `keywords`: The keywords found in the caption.
- `is_medical`: Whether the caption contains any target biomedical keywords.
<br><br>

This script saves the output both as a directory of processed JSONL files and a merged JSONL file. The former is used in the next step of the pipeline.
<br><br>


## **2. Subfigure Extraction**
> **Code:** `subfigure.py & subfigure.sh` <br>
> **Input:** Filtered figure-caption pairs in JSONL format (`${num}_meta.jsonl`) <br>
> **Output:** Directory of subfigure jpg files, and subfigure metadata in JSONL format (`${num}_subfigures.jsonl`) <br>

- Breakdown compound figures into subfigures.
- Keep original figure for non-compound figures or if an exception occurs.

Each datapoint contains the following fields:

When a subfigure is successfully detected and separated:
- `id`: Unique identifier for the subfigure (format: {source_figure_id}_{subfigure_number}.jpg)
- `source_fig_id`: ID of the original compound figure
- `PMC_ID`: PMC ID of the source article
- `media_name`: Original filename of the compound figure
- `position`: Coordinates of subfigure bounding box [(x1,y1), (x2,y2)]
- `score`: Detection confidence score
- `subfig_path`: Path to saved subfigure image

When subfigure extraction fails:
- `id`: Generated ID that would have been used
- `source_fig_id`: ID of the original figure
- `PMC_ID`: PMC ID of the source article
- `media_name`: Original filename

This script saves extracted subfigures as .jpg files in the target directory. Metadata for each subfigure is stored in separate JSONL files, with unique IDs that link back to the original figure-caption pairs in the source JSONL files.
<br><br>


## **3. Subcaption Extraction**
> **Code:** `subcaption.ipynb | subcaption.py & subcaption.sh` <br>
> **Input:** PMC metadata in JSONL format <br>
> **Output:** PMC metadata in JSONL format with subcaptions <br>

- Extract subcaptions from captions.
- Keep original caption if the caption cannot be split into subcaptions.

While this pipeline works, its slow as it goes through API calls one by one. There is a notebook (`subcaption.ipynb`) using batch API calls to speed it up. It's highly recommended to use the notebook instead of this script.
<br><br>


## **4. Classification**
> **Code:** `classify.py & classify.sh` <br>
> **Input:** Subfigure metadata in JSONL format (`${num}_subfigures.jsonl`) <br>
> **Output:** Subfigure metadata in JSONL format (`${num}_subfigures_classified.jsonl`) <br>

- Classify subfigures and include metadata about their class.

The following fields are added to each datapoint:
- `is_medical_subfigure`: Whether the subfigure is a medical subfigure.
- `medical_class_rank`: The model's confidence in the medical classification.

This script preserves all subfigures and adds an `is_medical_subfigure` boolean flag to identify medical subfigures. It also includes a `medical_class_rank` field indicating the model's confidence in the medical classification.
<br><br>


## **5. Alignment**
> **Code:** `align.py & align.sh` <br>
> **Input:** Subfigure metadata in JSONL format (`${num}_subfigures_classified.jsonl`) <br>
> **Output:** Aligned subfigure metadata in JSONL format (`${num}_aligned.jsonl`) <br>

- Find the label associated with each subfigure.
- If no label is found, it means either:
- The image is a standalone figure (not part of a compound figure)
- The OCR model failed to detect the subfigure label (e.g. "A", "B", etc.)

The non biomedical subfigures will be removed. The following fields are added to each datapoint:
- `label`: The label associated with the subfigure. (e.g. "Subfigure-A")
- `label_position`: The position of the label in the subfigure.


The outputs from steps 3 and 5 contain labeled subcaptions and labeled subfigures respectively. By matching these labels (e.g. "Subfigure-A"), we can create the final subfigure-subcaption pairs. Any cases where labels are missing or captions couldn't be split will be handled in subsequent steps. Refer to notebook for more details.
<br><br>
Empty file added openpmcvl/granular/__init__.py
Empty file.
Empty file.
Empty file.
34 changes: 34 additions & 0 deletions openpmcvl/granular/config/yolov3_default_subfig.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
MODEL:
TYPE: YOLOv3
BACKBONE: darknet53
ANCHORS: [[6, 7], [9, 10], [10, 14],
[13, 11], [16, 15], [15, 20],
[21, 19], [24, 24], [34, 31]]
ANCH_MASK: [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
N_CLASSES: 15
TRAIN:
LR: 0.001
MOMENTUM: 0.9
DECAY: 0.0005
BURN_IN: 1000
MAXITER: 20000
STEPS: (400000, 450000)
BATCHSIZE: 4
SUBDIVISION: 16
IMGSIZE: 608
LOSSTYPE: l2
IGNORETHRE: 0.7
AUGMENTATION:
RANDRESIZE: True
JITTER: 0.3
RANDOM_PLACING: True
HUE: 0.1
SATURATION: 1.5
EXPOSURE: 1.5
LRFLIP: False
RANDOM_DISTORT: True
TEST:
CONFTHRE: 0.8
NMSTHRE: 0.1
IMGSIZE: 416
NUM_GPUS: 1
Empty file.
Loading
Loading