2.0 KiB
python-pdf2txt
python-pdf2txt
is a Dockerized Python application designed to convert PDF files into editable Word documents. The application utilizes Flask to expose a web service that handles PDF file uploads through HTTP requests and returns the converted DOCX files using OCR technology.
Features
- PDF to Word Conversion: Transforms PDF documents into DOCX format using advanced OCR capabilities.
- Dockerized Application: Facilitates easy deployment and consistent performance across various environments.
- REST API: Simple API for straightforward integration, supporting PDF uploads and DOCX retrievals.
Getting Started
Step 1: Clone the Repository
Clone the repository to your local machine to get started:
git clone https://github.com/your-username/python-pdf2txt.git
cd python-pdf2txt
Step 2: Build and Run the Docker Container
Use Docker Compose to build and run your container:
docker-compose up --build -d
This command constructs the Docker image if it hasn't been built previously and runs the container in detached mode. The service will be available at localhost on port 4000.
Step 3: Convert a PDF to Word
Convert a PDF to a Word document by executing the following curl command:
curl -X POST -F "file=@path_to_your_pdf_file.pdf" http://localhost:4000/upload-pdf --output converted.docx
Make sure to replace path_to_your_pdf_file.pdf
with the actual path to the PDF you intend to convert. The output will be saved as converted.docx
.
Step 4: View Application Logs
To track the application's processes in real-time, you can view the logs:
tail -f ./logs/*
This command tails the log files, offering a live view into the application’s operational logs.
Additional Configuration
The application uses environment variables for additional configurations, such as the Tesseract data prefix, which can be adjusted in the docker-compose.yml
file to suit your setup needs.