Written by Boris Kl
"If you see cascading errors, find the first thing that fails and stop reading the log there. Everything after the first failure is the system reacting to the first failure." A production Python Telegram bot I was looking after started crashing every 2-3 hours. The traceback was a horror show β TelegramRetryAfter, then asyncio.TimeoutError, then sqlite3.OperationalError: database is locked, then 47 leaked sessions, then the process got OOM-killed, then systemd restarted it. Then it happened...