You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

54 lines
2.0 KiB

8 months ago
  1. # python-pdf2txt
  2. `python-pdf2txt` is a Dockerized Python application designed to convert PDF files into editable Word documents. The application utilizes Flask to expose a web service that handles PDF file uploads through HTTP requests and returns the converted DOCX files using OCR technology.
  3. ## Features
  4. - **PDF to Word Conversion**: Transforms PDF documents into DOCX format using advanced OCR capabilities.
  5. - **Dockerized Application**: Facilitates easy deployment and consistent performance across various environments.
  6. - **REST API**: Simple API for straightforward integration, supporting PDF uploads and DOCX retrievals.
  7. ## Getting Started
  8. ### Step 1: Clone the Repository
  9. Clone the repository to your local machine to get started:
  10. ```bash
  11. git clone https://github.com/your-username/python-pdf2txt.git
  12. cd python-pdf2txt
  13. ```
  14. ### Step 2: Build and Run the Docker Container
  15. Use Docker Compose to build and run your container:
  16. ```bash
  17. docker-compose up --build -d
  18. ```
  19. This command constructs the Docker image if it hasn't been built previously and runs the container in detached mode. The service will be available at localhost on port 4000.
  20. ### Step 3: Convert a PDF to Word
  21. Convert a PDF to a Word document by executing the following curl command:
  22. ```bash
  23. curl -X POST -F "file=@path_to_your_pdf_file.pdf" http://localhost:4000/upload-pdf --output converted.docx
  24. ```
  25. Make sure to replace `path_to_your_pdf_file.pdf` with the actual path to the PDF you intend to convert. The output will be saved as `converted.docx`.
  26. ### Step 4: View Application Logs
  27. To track the application's processes in real-time, you can view the logs:
  28. ```bash
  29. tail -f ./logs/*
  30. ```
  31. This command tails the log files, offering a live view into the application’s operational logs.
  32. ## Additional Configuration
  33. The application uses environment variables for additional configurations, such as the Tesseract data prefix, which can be adjusted in the `docker-compose.yml` file to suit your setup needs.