You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

122 lines
5.3 KiB

8 months ago
  1. # python-pdf2txt
  2. `python-pdf2txt` is a Dockerized Python application designed to convert PDF files into editable Word documents. The application utilizes Flask to expose a web service that handles PDF file uploads through HTTP requests and returns the converted DOCX files using OCR technology.
  3. ## Features
  4. - **PDF to Word Conversion**: Transforms PDF documents into DOCX format using advanced OCR capabilities.
  5. - **Dockerized Application**: Facilitates easy deployment and consistent performance across various environments.
  6. - **REST API**: Simple API for straightforward integration, supporting PDF uploads and DOCX retrievals.
  7. ## Getting Started
  8. ### Step 1: Clone the Repository
  9. Clone the repository to your local machine to get started:
  10. ```bash
  11. git clone https://github.com/your-username/python-pdf2txt.git
  12. cd python-pdf2txt
  13. ```
  14. ### Step 2: Build and Run the Docker Container
  15. Use Docker Compose to build and run your container:
  16. ```bash
  17. docker-compose up --build -d
  18. ```
  19. This command constructs the Docker image if it hasn't been built previously and runs the container in detached mode. The service will be available at localhost on port 4000.
  20. ### Step 3: Convert a PDF to Word
  21. Convert a PDF to a Word document by executing the following curl command:
  22. ```bash
  23. curl -X POST -F "file=@path_to_your_pdf_file.pdf" http://localhost:4000/upload-pdf --output converted.docx
  24. ```
  25. Make sure to replace `path_to_your_pdf_file.pdf` with the actual path to the PDF you intend to convert. The output will be saved as `converted.docx`.
  26. ### Step 4: View Application Logs
  27. To track the application's processes in real-time, you can view the logs:
  28. ```bash
  29. tail -f ./logs/*
  30. ```
  31. This command tails the log files, offering a live view into the application’s operational logs.
  32. ## New Features
  33. ### Callback URL Support
  34. The application now includes the ability to process PDF to Word conversions asynchronously. Once the conversion process is complete, the converted `.docx` file is sent to a specified callback URL. This feature allows the processing to occur in the background, freeing up clients to perform other tasks rather than waiting for a synchronous response.
  35. ## Updated Usage Instructions
  36. ### Asynchronous Conversion with Callback URL
  37. 1. **Trigger the Conversion**:
  38. To request a PDF conversion and have the application send the resulting `.docx` file to a callback URL once processing is done, use the curl command as follows:
  39. ```bash
  40. curl -X POST -F "file=@path_to_your_pdf_file.pdf" \
  41. -F "callback_url=http://<your-callback-url>/callback" \
  42. http://localhost:4000/upload-pdf
  43. ```
  44. Replace `path_to_your_pdf_file.pdf` with the path to your actual PDF file, and `<your-callback-url>` with your service's callback URL.
  45. 2. **Callback Server Setup**:
  46. Prepare your callback server to handle incoming POST requests at the `/callback` endpoint. The server should process the incoming `.docx` file as per your application's logic.
  47. ### Example Callback Endpoint
  48. Here's an example Flask route that could serve as your callback endpoint:
  49. ```python
  50. @app.route('/callback', methods=['POST'])
  51. def callback():
  52. file = request.files['file']
  53. # Implement file handling logic here
  54. return jsonify({'message': 'File received successfully'}), 200
  55. ```
  56. This endpoint will be invoked with the converted .docx file after the PDF conversion is complete.
  57. ### Test Command for Callback Feature
  58. To test the callback functionality, you can use the following curl command. This will send a PDF file to the /upload-pdf endpoint along with a callback_url. After the PDF is processed, the application will send the converted .docx file to the provided callback URL:
  59. ```bash
  60. curl -X POST -F "file=@uploads/sample_input.pdf" \
  61. -F "callback_url=http://localhost:5000/callback" \
  62. http://localhost:4000/upload-pdf
  63. ```
  64. Be sure to replace `http://localhost:5000/callback` with your actual callback endpoint that's ready to accept the file.
  65. ## Configuration Details
  66. ### Environment Variables
  67. The application uses several environment variables to configure its behavior:
  68. - `FLASK_ENV`: Sets the environment for the Flask application. In this case, `development` for enabling debug features.
  69. - `FLASK_APP`: Points to the entry file of the Flask application. Here, it's set to `app.py`.
  70. - `TESSDATA_PREFIX`: Specifies the directory where the Tesseract OCR data is stored, which is crucial for OCR functionality.
  71. ### Port Configuration
  72. The Docker container is configured to expose the Flask application on port `4000` of the host machine, mapping it to port 5000 inside the container. This mapping is defined in the `docker-compose.yml` file, allowing the application to be accessible via `http://localhost:4000`.
  73. ### Volume Management
  74. - **Uploads Folder**: The `./uploads` folder on the host is mapped to `/app/uploads` inside the Docker container. This is where uploaded PDF files are stored temporarily during processing.
  75. - **Outputs Folder**: Similarly, the `./outputs` folder on the host is mapped to `/app/outputs` inside the container. This folder stores the converted Word documents, making them accessible outside the container.
  76. - **Logs Folder**: The `./logs` folder is used to store log files generated by the application, providing insights into its operations and any errors.
  77. This setup ensures that data persists across container restarts and is easily accessible for both inputs and outputs.