跳到主要內容

BigQuery - Load JSON file to BigQuery

使用BigQuery的第一步,原則上就是開始把資料丟上BigQuery... 在Google Cloud Platform中有許多產品已經支援直接upload到BigQuery中,可以參考官方網站的說明... 這邊介紹的是如何將現成的JSON data匯入到BigQuery中(因為CSV相對簡單許多,這邊不贅述CSV的載入),使用到的工具是gsutil,建議可使用Cloud Platform整合版本的(這邊可以下載)

Sample Schema (1 record only)


BigQuery原則上吃csv與json兩種格式,其中json格式與標準的json有些許不同,他強調的是NEWLINE_DELIMITED_JSON,也就是說,每筆row之間是使用斷行隔開,而且每筆row就是獨立的一筆json資料欄... 但有別於csv,json格式的檔案是允許內部再有子階層的資料型態,以本範例為例,items為新的一個json array,可想像成RDBMS中的detail table資料...

{
 "transaction_time": "2013-10-20T14:32:22",
 "transaction_id": 50124,
 "total_spending": 300,
 "coupon_discount": 20,
 "items": [
   {
     "name": "xt920",
     "category": "mobile",
     "quantity": 10,
     "spending": 320
   },
   {
     "name": "xt920b",
     "category": "battery",
     "quantity": 11,
     "spending": 20
   },
   {
     "name": "xt921",
     "category": "mobile",
     "quantity": 10,
     "spending": 200
   }
 ]
}

上面資料是經過排版的結果,實際資料長相如下:

{"transaction_time":"2013-10-20T14:32:22","transaction_id":50124,"total_spending":300,"coupon_discount":20,"items":[{"name":"xt920","category":"mobile","quantity":10,"spending":320},{"name":"xt920b","category":"battery","quantity":11,"spending":20},{"name":"xt921","category":"mobile","quantity":10,"spending":200}]}
...(next data)

bq指令 - for JSON data

bq load --source_format NEWLINE_DELIMITED_JSON [target dataset].[target table] [json data] [schema data]

基本的bq操作指令如上面所示,因為預設是load CSV,所以在load json時候需要多加上"--source_format NEWLINE_DELIMITED_JSON"的提示,然後依序再帶入json data file與schema file讓bq可以載入檔案。


JSON Data Schema Definition
不論是CSV或是JSON都需要針對傳入的資料做欄位定義的動作,欄位的定義會方便BigQuery判斷查詢與製作索引,以JSON資料而言,定義檔也是json format,但是是json array,array中帶入每筆欄位的name(名稱), mode(是否可以為空、是否為集合型態等), type等資訊,其中mode為集合型態的話,則會多個欄位叫field的來收容階層資料資訊...
而載入資料若錯誤,系統會提示錯誤的內容,可再依照錯誤內容做修改... 以上面Sample Data為例,資料具有階層式的特性,下面先展示兩個錯誤的schema定義,在載入時候的錯誤提示:

Error1: Load File with “non-repeated field” error


$ cat schema.json
[
 {"name":"transaction_time","mode":"nullable","type":"string"},
 {"name":"transaction_id","mode":"required","type":"integer"},
 {"name":"total_spending","mode":"required","type":"integer"},
 {"name":"coupon_discount","mode":"required","type":"integer"},
 {"name":"items","mode":"required","type":"record", "fields": [
   {"name":"name","mode":"required","type":"string"},
   {"name":"category","mode":"required","type":"string"},
   {"name":"quantity","mode":"required","type":"integer"},
   {"name":"spending","mode":"required","type":"integer"},
   {"name":"descript","mode":"required","type":"string"}
 ]}
]

$ bq load --source_format NEWLINE_DELIMITED_JSON cp300.testdb ./sample.json ./schema.json
Waiting on bqjob_r3d39c5449048c912_00000141df277e2c_1 ... (25s) Current status: DONE
BigQuery error in load operation: Error processing job 'mitac-cp300:bqjob_r3d39c5449048c912_00000141df277e2c_1': Too many errors encountered. Limit is: 0.
Failure details:
- array specified for non-repeated field

Error2: Load File with “missing required field(s)” error


$ cat schema.json
[
 {"name":"transaction_time","mode":"nullable","type":"string"},
 {"name":"transaction_id","mode":"required","type":"integer"},
 {"name":"total_spending","mode":"required","type":"integer"},
 {"name":"coupon_discount","mode":"required","type":"integer"},
 {"name":"items","mode":"repeated","type":"record", "fields": [
   {"name":"name","mode":"required","type":"string"},
   {"name":"category","mode":"required","type":"string"},
   {"name":"quantity","mode":"required","type":"integer"},
   {"name":"spending","mode":"required","type":"integer"},
   {"name":"descript","mode":"required","type":"string"}
 ]}
]

$ bq load --source_format NEWLINE_DELIMITED_JSON cp300.testdb ./sample.json ./schema.json
Waiting on bqjob_r128c022a9d7e9a45_00000141df299024_1 ... (16s) Current status: DONE
BigQuery error in load operation: Error processing job 'mitac-cp300:bqjob_r128c022a9d7e9a45_00000141df299024_1': Too many errors encountered. Limit is: 0.
Failure details:
- missing required field(s)

在修改items欄位的定義之後,正確載入json的格式如下:

Finish Load JSON Data


$ cat schema.json
[
 {"name":"transaction_time","mode":"nullable","type":"string"},
 {"name":"transaction_id","mode":"required","type":"integer"},
 {"name":"total_spending","mode":"required","type":"integer"},
 {"name":"coupon_discount","mode":"required","type":"integer"},
 {"name":"items","mode":"repeated","type":"record", "fields": [
   {"name":"name","mode":"required","type":"string"},
   {"name":"category","mode":"required","type":"string"},
   {"name":"quantity","mode":"required","type":"integer"},
   {"name":"spending","mode":"required","type":"integer"},
   {"name":"descript","mode":"nullable","type":"string"}
 ]}
]

$ bq load --source_format NEWLINE_DELIMITED_JSON cp300.testdb ./sample.json ./schema.json
Waiting on bqjob_r2110b4bab734de17_00000141df2a4271_1 ... (69s) Current status: DONE

而正確的載入資料之後,就可以使用BigQuery的Portal來做資料的查詢,而階層式的資料查詢在BigQuery中會自動的做flatten的動作,因此呈現的資料表仍然是1個column,但是資料欄位名稱會以"."的方式將主欄位與階層欄位呈現出來,如下所示:



查詢的當下,也可以此方式指定條件或group by等等的方式。

留言

這個網誌中的熱門文章

Google指令碼基本操作介紹 - Web Server篇

Google的指令碼是什麼東西呢?!原則上他就是Google的一份靜態檔案,但是透過Google的雲端服務平台的一些能力,將靜態檔案內的scriptlet片段拉到Google的後端作運算,寫起來就像在寫JavaScript(這邊說Node.js可能比較貼切,因為同為server side language)或JSP,而在scriptlet片段中,則可以操作許多Google的API服務,甚至他提供你連接JDBC的能力、URL呼叫的能力...等,宛如就是一套完整的雲端程式語言(這樣說應該不為過拉,這真是個創新!),有並駕於App Engine的氣勢喔!
Google指令碼的範圍很廣,筆者也仍在摸索中,之前介紹過透過Sheet+指令碼做一個簡單的URL監控(這裡),而本篇簡單介紹一下指令碼如何製作一個Web Server(嚴格說起來是Web Page拉,但是具備Server端運作功能喔!)。您將可以體驗到No-Hosting Web Server的威力!
指令碼是Google Drive的一個服務,Google將指令碼(Code)以檔案方式寄存在Drive中,類似的靜態檔案服務的應用,最近滿火紅的!

首先開啟指令碼時候,選擇"作為網路應用程式的指令碼",檔案開啟後,會有愈設定程式碼片段供編輯


程式碼片段大致上如下,是一個doGet function,Web base的指令碼需要認得doGet()作為server的進入點 如果選擇到空白專案的話,只要把doGet function建上即可

作為一個Cloud IDE,Google當然也有把Code Hint擺上來,透過簡單的提示,寫啟程是來就更容易拉!

而Web部分物件的建立主要是透過HtmlService這個模組來進行操作,我們利用他來output html, load static html page, load template html page..等,範例如下:
Output HTML: // Script-as-app template.
function doGet(e) {
  return HtmlService.createHtmlOutput("<h1>HELLO!</h1>");
}
透過上HtmlService的createHtmlOutput的功能,…

透過Google指令碼,存取Google Cloud SQL

既然Cloud SQL提供了MySQL的Feature 那麼透過指令碼(Apps Script)來存取Cloud SQL應該也是OK的拉! 這邊介紹一下在指令碼中操作的方式
首先要確認一下您在Cloud SQL中申請的DB Instance名稱 而名稱可以在API Console中找到

此範例是屬於手動執行的部份,因此這邊建一個test function來收容db query的code...
function test(){
  var conn = Jdbc.getCloudSqlConnection("jdbc:google:rdbms://[db instance name]/[db name]");
  var stmt = conn.createStatement();
  stmt.setMaxRows(100);
  var start = new Date();
  var sql = "select * from member_info";
  var rs = stmt.executeQuery(sql);
  Logger.log('SQL:' + sql);
  while(rs.next()){
    Logger.log('['+new Date().toString() + ']' + rs.getString(1) + '::' + rs.getString(2) );
  }
}
Project看起來像這樣:

此時,在工具列選擇要執行的Function後,點選執行按鈕,就可以執行撈取Cloud SQL的動作了...

執行時候會發現Google會跟您要求受權,請user同意讓Script Editor來執行這部份程式,同意後才可以執行... 這部份是說,如果將來把這段程式碼放到Web Service中時候,該Web Service App也需要經過授權才能使用...

New Google Form Feature...

Google這次在Google Form上面做了一些改變...原本的Google Form背後都會連結到一個特定的表單(一個Form一個表單) 這次的修改就是針對這部份的限制做改善 目前可以透過Form編輯的頁面上"選擇回憶目的地"來選擇...



這邊是選擇的畫面,可以選擇新的試算表或是選擇現有的試算表中的新工作表...


如此一來,表單的運作將更具彈性拉!
詳細的Google說明,可以在這邊找到:https://support.google.com/drive/bin/answer.py?hl=en&answer=2917686&p=forms_response