You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
Mam Thenebo fd8242d2e3 major update 8 months ago
logs major update 8 months ago
tessdata feat: Add Dockerfile, app.py, docker-compose.yml, pdf_to_word.py, and requirements.txt- Add Dockerfile for setting up a Python environment- Add app.py for a Flask application to upload PDF files- Add docker-compose.yml for containerizing the Flask app- Add pdf_to_word.py for converting PDF to Word using pytesseract- Add requirements.txt with necessary dependencies 8 months ago
uploads major update 8 months ago
.DS_Store major update 8 months ago
.dockerignore major update 8 months ago
.gitignore major update 8 months ago
Dockerfile major update 8 months ago
LICENSE Initial commit 8 months ago
README.md docs: Update README with detailed configuration information and volume management setup 8 months ago
app.py major update 8 months ago
celery_worker.py major update 8 months ago
converted.docx major update 8 months ago
docker-compose.yml major update 8 months ago
pdf_to_word.py major update 8 months ago
requirements.txt major update 8 months ago
response.json major update 8 months ago

README.md

python-pdf2txt

python-pdf2txt is a Dockerized Python application designed to convert PDF files into editable Word documents. The application utilizes Flask to expose a web service that handles PDF file uploads through HTTP requests and returns the converted DOCX files using OCR technology.

Features

  • PDF to Word Conversion: Transforms PDF documents into DOCX format using advanced OCR capabilities.
  • Dockerized Application: Facilitates easy deployment and consistent performance across various environments.
  • REST API: Simple API for straightforward integration, supporting PDF uploads and DOCX retrievals.

Getting Started

Step 1: Clone the Repository

Clone the repository to your local machine to get started:

git clone https://github.com/your-username/python-pdf2txt.git
cd python-pdf2txt

Step 2: Build and Run the Docker Container

Use Docker Compose to build and run your container:

docker-compose up --build -d

This command constructs the Docker image if it hasn't been built previously and runs the container in detached mode. The service will be available at localhost on port 4000.

Step 3: Convert a PDF to Word

Convert a PDF to a Word document by executing the following curl command:

curl -X POST -F "file=@path_to_your_pdf_file.pdf" http://localhost:4000/upload-pdf --output converted.docx

Make sure to replace path_to_your_pdf_file.pdf with the actual path to the PDF you intend to convert. The output will be saved as converted.docx.

Step 4: View Application Logs

To track the application's processes in real-time, you can view the logs:

tail -f ./logs/*

This command tails the log files, offering a live view into the application’s operational logs.

Configuration Details

Environment Variables

The application uses several environment variables to configure its behavior:

  • FLASK_ENV: Sets the environment for the Flask application. In this case, development for enabling debug features.
  • FLASK_APP: Points to the entry file of the Flask application. Here, it's set to app.py.
  • TESSDATA_PREFIX: Specifies the directory where the Tesseract OCR data is stored, which is crucial for OCR functionality.

Port Configuration

The Docker container is configured to expose the Flask application on port 4000 of the host machine, mapping it to port 5000 inside the container. This mapping is defined in the docker-compose.yml file, allowing the application to be accessible via http://localhost:4000.

Volume Management

  • Uploads Folder: The ./uploads folder on the host is mapped to /app/uploads inside the Docker container. This is where uploaded PDF files are stored temporarily during processing.
  • Outputs Folder: Similarly, the ./outputs folder on the host is mapped to /app/outputs inside the container. This folder stores the converted Word documents, making them accessible outside the container.
  • Logs Folder: The ./logs folder is used to store log files generated by the application, providing insights into its operations and any errors.

This setup ensures that data persists across container restarts and is easily accessible for both inputs and outputs.