ViTLP

ViTLP

A pre training model for generating text layout based on visual guidance of document intelligence

0
  • • Native OCR text localization and recognition: ViTLP can directly locate and recognize text on document images.
  • Pre trained model ViTLP medium: Provides a pre trained model with 380M parameters that can deliver good performance with limited computing resources.
  • • Fast inference speed: On Nvidia 4090, ViTLP can quickly process document images, with inference speed completing the processing of one page of document images in 5 to 10 seconds.
  • Huggingface platform support: The pre trained weights of ViTLP models can be found on the Huggingface platform, making it convenient for users to download and use.
  • • Easy to integrate and use: With the provided code and instructions, users can easily integrate ViTLP into their own projects.
  • • Support batch decoding: Users can decode batch document images through the provided decode.sh script.
  • • Suitable for intelligent document processing: ViTLP is particularly suitable for scenarios that require document image and text detection and recognition, such as automated document processing, archive digitization, etc.

Product Details

ViTLP is a visually guided pre trained model for generating text layouts, aimed at improving the efficiency and accuracy of intelligent document processing. This model combines OCR text localization and recognition functions, enabling fast and accurate text detection and recognition on document images. The pre trained version of the ViTLP model, ViTLP medium (380M parameters), provides a balanced solution under the limitations of computing resources and pre training dataset size, ensuring model performance while optimizing inference speed and memory usage. The inference speed of ViTLP on Nvidia 4090 typically takes 5 to 10 seconds to process a page of document images, which is competitive compared to most OCR engines.