AI开源

olmocr

用于训练语言模型以处理野外 ...

标签:

用于训练语言模型以处理 PDF 文档的工具包。

尝试在线演示:https://olmocr.allenai.org/

包含的内容:

要求:

  • 最新的 NVIDIA GPU(在 RTX 4090、L40S、A100、H100 上测试)
  • 30GB 可用磁盘空间

您将需要安装 poppler-utils 和其他字体来呈现 PDF 图像。

安装依赖项(Ubuntu/Debian)

sudo apt-get update
sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools

设置conda环境并安装olmocr

conda create -n olmocr python=3.11
conda activate olmocr

git clone https://github.com/allenai/olmocr.git
cd olmocr
pip install -e .

如果您想在 GPU 上运行推理,请使用flashinfer安装 sglang 。

pip install sgl-kernel==0.0.3.post1 --force-reinstall --no-deps
pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/

本地使用示例

 

如需快速测试,请试用Web 演示。要在本地运行,需要 GPU,因为推理由sglang提供支持。转换单个 PDF:

python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf

转换多个 PDF:

python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf

结果将以 JSON 格式存储在 中./localworkspace

查看结果

 

提取的文本以Dolma风格的 JSONL 形式存储在目录内./localworkspace/results

cat localworkspace/results/output_*.jsonl

与原始 PDF 并排查看结果(使用dolmaviewer命令):

python -m olmocr.viewer.dolmaviewer localworkspace/results/output_*.jsonl

现在在您最喜欢的浏览器中打开./dolma_previews/tests_gnarly_pdfs_horribleocr_pdf.html

olmocr

多节点/集群使用

 

如果您想使用并行运行的多个节点转换数百万个 PDF,那么 olmOCR 支持从 AWS S3 读取您的 PDF,并使用 AWS S3 输出存储桶协调工作。

例如,您可以在第一个工作节点上启动此命令,它将在您的 AWS 存储桶中设置一个简单的工作队列并开始转换 PDF。

python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf

现在在任何后续节点上,只需运行此命令,它们就会开始从同一个工作区队列中抓取项目。

python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace

如果您在 Ai2 并且想要使用beaker高效地线性化数百万个 PDF ,只需添加该--beaker 标志。这将在您的本地机器上准备工作区,然后在集群中启动 N 个 GPU 工作器以开始转换 PDF。

例如:

python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf --beaker --beaker_gpus 4

管道的完整文档

 

python -m olmocr.pipeline --help
usage: pipeline.py [-h] [--pdfs PDFS] [--workspace_profile WORKSPACE_PROFILE] [--pdf_profile PDF_PROFILE] [--pages_per_group PAGES_PER_GROUP]
                   [--max_page_retries MAX_PAGE_RETRIES] [--max_page_error_rate MAX_PAGE_ERROR_RATE] [--workers WORKERS] [--apply_filter] [--stats] [--model MODEL]
                   [--model_max_context MODEL_MAX_CONTEXT] [--model_chat_template MODEL_CHAT_TEMPLATE] [--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM]
                   [--target_anchor_text_len TARGET_ANCHOR_TEXT_LEN] [--beaker] [--beaker_workspace BEAKER_WORKSPACE] [--beaker_cluster BEAKER_CLUSTER]
                   [--beaker_gpus BEAKER_GPUS] [--beaker_priority BEAKER_PRIORITY]
                   workspace

Manager for running millions of PDFs through a batch inference pipeline

positional arguments:
  workspace             The filesystem path where work will be stored, can be a local folder, or an s3 path if coordinating work with many workers, s3://bucket/prefix/

options:
  -h, --help            show this help message and exit
  --pdfs PDFS           Path to add pdfs stored in s3 to the workspace, can be a glob path s3://bucket/prefix/*.pdf or path to file containing list of pdf paths
  --workspace_profile WORKSPACE_PROFILE
                        S3 configuration profile for accessing the workspace
  --pdf_profile PDF_PROFILE
                        S3 configuration profile for accessing the raw pdf documents
  --pages_per_group PAGES_PER_GROUP
                        Aiming for this many pdf pages per work item group
  --max_page_retries MAX_PAGE_RETRIES
                        Max number of times we will retry rendering a page
  --max_page_error_rate MAX_PAGE_ERROR_RATE
                        Rate of allowable failed pages in a document, 1/250 by default
  --workers WORKERS     Number of workers to run at a time
  --apply_filter        Apply basic filtering to English pdfs which are not forms, and not likely seo spam
  --stats               Instead of running any job, reports some statistics about the current workspace
  --model MODEL         List of paths where you can find the model to convert this pdf. You can specify several different paths here, and the script will try to use the
                        one which is fastest to access
  --model_max_context MODEL_MAX_CONTEXT
                        Maximum context length that the model was fine tuned under
  --model_chat_template MODEL_CHAT_TEMPLATE
                        Chat template to pass to sglang server
  --target_longest_image_dim TARGET_LONGEST_IMAGE_DIM
                        Dimension on longest side to use for rendering the pdf pages
  --target_anchor_text_len TARGET_ANCHOR_TEXT_LEN
                        Maximum amount of anchor text to use (characters)
  --beaker              Submit this job to beaker instead of running locally
  --beaker_workspace BEAKER_WORKSPACE
                        Beaker workspace to submit to
  --beaker_cluster BEAKER_CLUSTER
                        Beaker clusters you want to run on
  --beaker_gpus BEAKER_GPUS
                        Number of gpu replicas to run
  --beaker_priority BEAKER_PRIORITY
                        Beaker priority level for the job

团队

 

olmOCR由 AllenNLP 团队开发和维护,由艾伦人工智能研究所 (AI2)提供支持。AI2 是一家非营利性机构,其使命是通过高影响力的人工智能研究和工程为人类做出贡献。要了解有关谁为该代码库做出具体贡献的更多信息,请参阅我们的贡献者页面。

执照

 

olmOCR在Apache 2.0下获得许可。可以在 GitHub 上找到许可证的完整副本。

相关导航