BigQuery Data Platform - Pending Work
Status: Core infrastructure complete, data syncs and personalization pending Last Updated: February 2026
What’s Been Implemented
The BigQuery data warehouse is operational with the core infrastructure in place:
| Component | Status | Notes |
|---|---|---|
| GCP Project | Done | arlem-analytics-warehouse, australia-southeast1 |
| BigQuery Schema | Done | 8 tables: visitors, sessions, events, shopifyorders, metaadsspend, klaviyoprofiles, instagramposts, gscqueries |
| Cosmos to BigQuery Sync | Done | Hourly sync, 5,846 visitors/8,158 sessions/16,971 events migrated |
| Shopify Orders Sync | Done | 49 orders synced with product categorisation |
| Meta Ads Sync | Done | 286 rows (Feb 2023 - Jul 2024, $1,342.65 spend) |
| Analytics Views | Done | orderattribution, campaignroas, weeklyperformance, sourcesummary, channel_influence |
| BigQuery MCP Tools | Done | 5 tools in arlem-analytics MCP server |
| Visitor Segment Hook | Done | useVisitorSegment() returns Cold/Interested/Warm/Hot/Customer |
| Segment Reverse Sync | Done | BigQuery to Cosmos to API response to localStorage |
Data Flow:
Website Events --> Cosmos DB --> BigQuery (hourly)
|
External APIs (Shopify, Meta) ------+
|
BigQuery --> Analytics Views --> MCP Tools --> Claude Code
--> Cosmos --> API --> localStorage --> useVisitorSegment()Pending Work
1. Data Sync: Klaviyo Profiles
Priority: Medium Effort: 4-6 hours
Sync Klaviyo profiles to BigQuery for email-based identity resolution.
Why it matters:
- Link email subscribers to visitor journeys
- Enable email-based retargeting
- Currently only 0.1% of visitors have email captured
Implementation:
- Create
scheduledKlaviyoSync.ts(daily timer trigger) - Use existing Klaviyo MCP API patterns
- Sync profiles updated since last sync
- Link to visitor_id via email match
- Insert to
klaviyo_profilestable
Table schema (already exists):
klaviyo_profiles (profile_id, email, first_name, last_name, created, subscribed, lists, visitor_id, synced_at)2. Data Sync: Instagram Posts
Priority: Low Effort: 4-6 hours
Sync Instagram posts with insights to BigQuery.
Why it matters:
- Track which posts drive traffic and sales
- Currently can’t answer “Which Instagram posts lead to revenue?”
Implementation:
- Create
scheduledInstagramSync.ts(daily timer trigger) - Use existing Instagram MCP API patterns
- Sync posts with engagement metrics
- Insert to
instagram_poststable
Table schema (already exists):
instagram_posts (post_id, posted_at, caption, media_type, permalink, impressions, reach, engagement, saves, synced_at)3. Data Sync: Google Search Console
Priority: Medium Effort: 2-4 hours
Current state: Table exists but has 0 rows. Pipeline broken or never enabled.
Why it matters:
- Track search performance and SEO health
- Required for content gap analysis
Implementation:
- Create
scheduledGSCSync.ts(daily timer trigger) - Use existing GSC MCP API patterns
- Note: GSC data has 3-day delay
- Insert to
gsc_queriestable
Table schema (already exists):
gsc_queries (date, query, page, clicks, impressions, ctr, position, synced_at)4. Personalization: Segment-Aware Components
Priority: High Effort: 1-2 days
The useVisitorSegment() hook is working but no components use it yet.
Implementation:
- Create
ForSegmentwrapper component - Create
ForPropensitywrapper component - Implement segment-specific UI:
- Cold: Brand story banner, trust badges - Interested: “Compare styles” prompt, size guide visibility - Warm: Customer reviews, “Still deciding?” message - Hot: Cart reminder bar, urgency messaging - Customer: Welcome back, cross-sell suggestions
Example usage (from spec):
<ForSegment segments={['Warm', 'Hot', 'Customer']}>
<CustomerReviews product={product} />
</ForSegment>5. Personalization: Propensity Scoring
Priority: High Effort: 1-2 days
Add propensity score alongside journey segment for finer-grained personalization.
The validated scoring model:
propensity_score (0-16) =
Gallery swipes: (20+: +5, 10-19: +3, 5-9: +2, 1-4: +1)
Sessions: (3+: +3, 2: +2)
Products viewed: (3+: +2, 2+: +1)
AU/NZ location: +1
High-intent source (IG Click, Google Shopping): +2Key thresholds:
| Score | Precision | Recall | Use Case |
|---|---|---|---|
| >= 5 | 16% | 51.5% | Standard personalization |
| >= 8 | 34% | 27% | Discount offers |
| >= 10 | 54% | 15% | VIP treatment |
Implementation:
- Compute propensity_score in BigQuery (already in schema)
- Return score in
/api/pingresponse - Store in localStorage as
_ps - Create
usePropensityScore()hook - Trigger offers based on thresholds
6. Tracking Gaps to Fix
Priority: Critical (for attribution accuracy)
| Gap | Impact | Fix |
|---|---|---|
| Order-to-visitor linkage | 78% of purchases unattributed | Store visitor_id in Shopify order notes via cart attributes |
| Click ID persistence | fbclid/gclid lost at checkout | Server-side tracking (Meta CAPI, Google Enhanced) |
| Email capture rate | 0.1% of visitors | Exit-intent popup, “Save cart” feature |
| Checkout events | No visibility between cart and purchase | Add checkoutstarted, checkoutcompleted events |
7. Validation Tasks
Priority: Medium Effort: 2-4 hours
- Validate Cosmos to BigQuery data integrity
- Validate Shopify order counts (49 synced vs actual)
- Spot-check 10 random visitor journeys
- Compare BigQuery ROAS with Meta dashboard
- Document expected differences
- Performance test sync functions (< 15 min)
- Test segment API response time (< 500ms)
- Review and adjust journey stage thresholds
Key Insights from Analysis
These findings from the initial BigQuery analysis inform the pending work:
| Finding | Implication |
|---|---|
| Instagram Click converts 7.5x better than Direct | Shift ad spend to Instagram Click |
| Gallery swipes 20+ = 50%+ cart rate | Surface gallery earlier, trigger CTAs after swipes |
| Sunday converts 2x vs Tuesday | Schedule campaigns for Sunday |
| 517 Warm visitors stuck | High-intent, need personalization nudge |
| 86% cart abandonment | Above average but recoverable; address shipping/size uncertainty |
| TikTok: 355 visitors, 0 carts | Pause or fix targeting |
| Retargeting: 2% of spend, 0 carts | Increase budget to 30%, segment properly |
Files Reference
The original implementation is documented in:
analytics/data/constants.ts- BigQuery configanalytics/types/bigquery.ts- Row typesanalytics/repositories/bigquery.ts- Clientanalytics/repositories/bigqueryTransformers.ts- Data transformsanalytics/src/functions/scheduledBigQuerySync.ts- Hourly syncanalytics/src/functions/scheduledShopifySync.ts- Shopify ordersanalytics/src/functions/scheduledMetaAdsSync.ts- Meta Ads spendanalytics/sql/views/*.sql- Analytics viewsmcp/src/arlem-analytics/tools/bigquery.ts- MCP toolsweb/lib/hooks/use-visitor-segment.ts- Segment hook
Cost
Current monthly cost: ~$5-10 (BigQuery streaming inserts are the main variable)
BigQuery storage is 12x cheaper than Cosmos DB, so adding capability actually reduced overall costs.