Server timeouts caused orphaned fetchActivityData calls to fire clearCookieCache()
asynchronously, destroying cookies for all concurrent callers. Three fixes:
1. Replace Promise.race timeout with AbortController to properly cancel
orphaned fetches and prevent delayed clearCookieCache() calls
2. Add cookie backup/restore — backupCookies() before clearCookieCache(),
restoreCookieBackup() if re-login fails, so cookies are never lost
3. Add 15s auth failure throttle to block thundering herd re-logins when
server slowdowns generate many 500 errors simultaneously
KEY INSIGHT:
- 500 = cookie expiration (early signal, re-login immediately)
- 502/503/504 = real server outage (bad gateway, service unavailable, gateway timeout)
BEHAVIOR:
- On 500: throw AuthenticationError → immediate re-login
- On 502/503/504: preserve cache, don't re-login (server is down)
- On 401/403: throw AuthenticationError → re-login
This prevents unnecessary re-login attempts during actual server outages
while still handling cookie expiration immediately.
KEY DISCOVERY: 5xx errors are early signs of cookie expiration.
The backend returns 500 when cookie is expired but session not yet invalidated.
It takes several hours before it returns 401/403.
CHANGES:
1. On 5xx: throw AuthenticationError to trigger immediate re-login
2. Removed cookie validation logic (no longer needed)
3. Cache still preserved during re-login process
4. Re-login happens within same request, not on next request
This fixes the issue where expired cookies would cause 5xx errors
for hours before any re-login attempt was made.
Cookie validation improvements:
1. Validate with activity ID 3350 (more reliable test endpoint)
2. Distinguish 5xx (outage) from 4xx (invalid cookie) during validation
3. On 5xx during validation: preserve cookie, don't re-login (server outage)
4. On 401/403 during validation: clear cookie and re-login
5. On network error during validation: preserve cookie (treat as server issue)
This prevents unnecessary re-logins during server outages.
Critical fixes:
1. getActivityDetailsRaw never throws on 5xx - returns null immediately
2. Cache-manager preserves existing data when fetch returns null
3. After 5xx error, validate cookie on next request (backend may invalidate sessions)
4. Cookie validation: fetch activity ID 1 to test, re-login if fails
This prevents local cache corruption during server outages.
- Reduce default CONCURRENT_API_CALLS from 10 to 5 (Sharp AVIF is CPU-intensive)
- Create fresh p-limit instance per batch instead of module singleton
- Add garbage collection hint between batches
- Fix skippedCount tracking (was never incremented)
- Increase batch delay from 100ms to 500ms for event loop drainage
- Process activities in batches of 100 instead of 5001 promises upfront
- Clear promise array after each batch to free memory (85MB→15MB peak)
- Reduce API timeout from 20s to 10s and retries from 3 to 2
- Total time per failed request: 63s→23s (63% faster failure)
- Expected total scan time: 8.5h→1.5h (82% faster)
- Add mutex to cron jobs to prevent overlapping runs
- Replace Promise.all with batched processing (50/batch) in updateStaleClubs
- Configure HTTP connection pooling with keep-alive (maxSockets: 50)
- Add memory monitoring to scan progress logs
- Reduce CONCURRENT_API_CALLS from 8 to 5 to reduce Sharp memory pressure
Root cause: Promise.all() waits for ALL promises, so a single hung/slow request
blocks the entire batch. With 5001 promises and 16 concurrent limit, timeouts
cause cascading delays that appear as 'scan stopped'.
Fix:
- Extract processSingleActivity() helper function
- Use Promise.allSettled() instead of Promise.all()
- Each promise handles its own success/error counting
- Prevents single hung promise from blocking entire scan
Impact: Scan should now complete all 5001 IDs without getting stuck
Before: Pre-validate cookie before every request (2-4 API calls per activity)
After: Direct request, only validate on 4xx error (1-2 API calls per activity)
Changes:
- Remove pre-validation step in fetchActivityData
- Keep existing 4xx error handling with re-login logic
- Add debug log to track cookie usage
Impact: ~20-30% reduction in API calls for normal scenarios
Benefit: Faster scanning, less load on engage API
- Fix Redis SCAN cursor type conversion (Buffer to String) to prevent early termination
- Add progress logging in initializeClubCache (every 100 activities with summary)
- Add Redis memory limits (512MB with LRU eviction policy)
- Implement cache TTL: 24h for normal data, 1h for error states (allows retry)
- Fix Docker permission issue by running app container as root
- Add TTL configuration to .env and example.env
Root cause: SCAN cursor comparison failed due to type mismatch (Buffer vs String)
Impact: Scanning now processes all 5000+ IDs instead of stopping at ~300