# check_pa [![Python tests](https://git.ccc-p.org/cyroxx/check_pa/actions/workflows/python-tests.yml/badge.svg?branch=main)](https://git.ccc-p.org/cyroxx/check_pa/actions/workflows/python-tests.yml) Monitor the Berlin Perso/Passport portal, crack the audio CAPTCHA with Whisper, and notify a webhook when fresh status information drops. ## Features - Automates Firefox with Selenium to reach the Berlin appointment status page - Downloads and transcribes audio CAPTCHAs via Whisper, falling back between attempts - Normalizes the returned status text and emits structured data to any HTTP webhook - Includes tooling for collecting captcha samples and benchmarking transcription quality ## Requirements - Python 3.12 - Firefox + `geckodriver` in `$PATH` for Selenium - `ffmpeg` (needed by `openai-whisper`) - Optional: Tesseract OCR if you experiment with the image-based approach in `ocr/` - Optional: Python packages from `requirements-ocr.txt` when working on the OCR experiments ## Setup 1. Clone the repo and create a virtual environment: `python -m venv .venv && source .venv/bin/activate` 2. Install runtime dependencies: `pip install -r requirements.txt` 3. (Optional) Install OCR extras: `pip install -r requirements-ocr.txt` 4. (Optional) Add tooling such as pytest: `pip install -r dev-requirements.txt` 5. Provide credentials: - Copy `settings.example.py` to `settings.py` - Set `DOCUMENT_ID` (the identifier embedded in the Berlin status URL) - Set `WEBHOOK_URL` pointing to the service that should receive status payloads ## Usage Run `python main.py` to start a polling cycle. The script will: 1. Launch Firefox (set `USE_HEADLESS_MODE = True` in `main.py` for CI/servers) 2. Download the audio CAPTCHA into `audio_captchas/` 3. Transcribe it with Whisper via `transcription.py` 4. Parse the resulting status page and post `{status, last_updated}` to your webhook Helpful utilities: - `test_transcription()` inside `main.py` evaluates every mp3 in `audio_captchas/` and writes `transcription_results.csv` - `test_parse_status_page()` parses the fixtures in `sample_html/` to validate the BeautifulSoup logic - `ocr/recognize*.py` contains earlier OCR experiments for the visual CAPTCHA ## Testing - Install dev tooling: `pip install -r dev-requirements.txt` - Run `pytest` ## Data & Artifacts - `audio_captchas/` collects downloaded mp3 files for debugging/benchmarking - `captchas/` and `ocr/` scripts help capture and label the image CAPTCHAs - `sample_html/` and `samples/` host anonymized HTML snapshots used for parsing tests ## Troubleshooting - Whisper may need the `ffmpeg` binary; ensure `ffmpeg -version` works inside the venv - If Selenium cannot start, verify `geckodriver --version` is available and matches your Firefox version - For webhook issues, run a tool like `nc -l 8080` or `smee.io` to inspect outbound payloads