跳到主要內容

BigQuery - Load JSON file to BigQuery

使用BigQuery的第一步,原則上就是開始把資料丟上BigQuery... 在Google Cloud Platform中有許多產品已經支援直接upload到BigQuery中,可以參考官方網站的說明... 這邊介紹的是如何將現成的JSON data匯入到BigQuery中(因為CSV相對簡單許多,這邊不贅述CSV的載入),使用到的工具是gsutil,建議可使用Cloud Platform整合版本的(這邊可以下載)

Sample Schema (1 record only)


BigQuery原則上吃csv與json兩種格式,其中json格式與標準的json有些許不同,他強調的是NEWLINE_DELIMITED_JSON,也就是說,每筆row之間是使用斷行隔開,而且每筆row就是獨立的一筆json資料欄... 但有別於csv,json格式的檔案是允許內部再有子階層的資料型態,以本範例為例,items為新的一個json array,可想像成RDBMS中的detail table資料...

{
 "transaction_time": "2013-10-20T14:32:22",
 "transaction_id": 50124,
 "total_spending": 300,
 "coupon_discount": 20,
 "items": [
   {
     "name": "xt920",
     "category": "mobile",
     "quantity": 10,
     "spending": 320
   },
   {
     "name": "xt920b",
     "category": "battery",
     "quantity": 11,
     "spending": 20
   },
   {
     "name": "xt921",
     "category": "mobile",
     "quantity": 10,
     "spending": 200
   }
 ]
}

上面資料是經過排版的結果,實際資料長相如下:

{"transaction_time":"2013-10-20T14:32:22","transaction_id":50124,"total_spending":300,"coupon_discount":20,"items":[{"name":"xt920","category":"mobile","quantity":10,"spending":320},{"name":"xt920b","category":"battery","quantity":11,"spending":20},{"name":"xt921","category":"mobile","quantity":10,"spending":200}]}
...(next data)

bq指令 - for JSON data

bq load --source_format NEWLINE_DELIMITED_JSON [target dataset].[target table] [json data] [schema data]

基本的bq操作指令如上面所示,因為預設是load CSV,所以在load json時候需要多加上"--source_format NEWLINE_DELIMITED_JSON"的提示,然後依序再帶入json data file與schema file讓bq可以載入檔案。


JSON Data Schema Definition
不論是CSV或是JSON都需要針對傳入的資料做欄位定義的動作,欄位的定義會方便BigQuery判斷查詢與製作索引,以JSON資料而言,定義檔也是json format,但是是json array,array中帶入每筆欄位的name(名稱), mode(是否可以為空、是否為集合型態等), type等資訊,其中mode為集合型態的話,則會多個欄位叫field的來收容階層資料資訊...
而載入資料若錯誤,系統會提示錯誤的內容,可再依照錯誤內容做修改... 以上面Sample Data為例,資料具有階層式的特性,下面先展示兩個錯誤的schema定義,在載入時候的錯誤提示:

Error1: Load File with “non-repeated field” error


$ cat schema.json
[
 {"name":"transaction_time","mode":"nullable","type":"string"},
 {"name":"transaction_id","mode":"required","type":"integer"},
 {"name":"total_spending","mode":"required","type":"integer"},
 {"name":"coupon_discount","mode":"required","type":"integer"},
 {"name":"items","mode":"required","type":"record", "fields": [
   {"name":"name","mode":"required","type":"string"},
   {"name":"category","mode":"required","type":"string"},
   {"name":"quantity","mode":"required","type":"integer"},
   {"name":"spending","mode":"required","type":"integer"},
   {"name":"descript","mode":"required","type":"string"}
 ]}
]

$ bq load --source_format NEWLINE_DELIMITED_JSON cp300.testdb ./sample.json ./schema.json
Waiting on bqjob_r3d39c5449048c912_00000141df277e2c_1 ... (25s) Current status: DONE
BigQuery error in load operation: Error processing job 'mitac-cp300:bqjob_r3d39c5449048c912_00000141df277e2c_1': Too many errors encountered. Limit is: 0.
Failure details:
- array specified for non-repeated field

Error2: Load File with “missing required field(s)” error


$ cat schema.json
[
 {"name":"transaction_time","mode":"nullable","type":"string"},
 {"name":"transaction_id","mode":"required","type":"integer"},
 {"name":"total_spending","mode":"required","type":"integer"},
 {"name":"coupon_discount","mode":"required","type":"integer"},
 {"name":"items","mode":"repeated","type":"record", "fields": [
   {"name":"name","mode":"required","type":"string"},
   {"name":"category","mode":"required","type":"string"},
   {"name":"quantity","mode":"required","type":"integer"},
   {"name":"spending","mode":"required","type":"integer"},
   {"name":"descript","mode":"required","type":"string"}
 ]}
]

$ bq load --source_format NEWLINE_DELIMITED_JSON cp300.testdb ./sample.json ./schema.json
Waiting on bqjob_r128c022a9d7e9a45_00000141df299024_1 ... (16s) Current status: DONE
BigQuery error in load operation: Error processing job 'mitac-cp300:bqjob_r128c022a9d7e9a45_00000141df299024_1': Too many errors encountered. Limit is: 0.
Failure details:
- missing required field(s)

在修改items欄位的定義之後,正確載入json的格式如下:

Finish Load JSON Data


$ cat schema.json
[
 {"name":"transaction_time","mode":"nullable","type":"string"},
 {"name":"transaction_id","mode":"required","type":"integer"},
 {"name":"total_spending","mode":"required","type":"integer"},
 {"name":"coupon_discount","mode":"required","type":"integer"},
 {"name":"items","mode":"repeated","type":"record", "fields": [
   {"name":"name","mode":"required","type":"string"},
   {"name":"category","mode":"required","type":"string"},
   {"name":"quantity","mode":"required","type":"integer"},
   {"name":"spending","mode":"required","type":"integer"},
   {"name":"descript","mode":"nullable","type":"string"}
 ]}
]

$ bq load --source_format NEWLINE_DELIMITED_JSON cp300.testdb ./sample.json ./schema.json
Waiting on bqjob_r2110b4bab734de17_00000141df2a4271_1 ... (69s) Current status: DONE

而正確的載入資料之後,就可以使用BigQuery的Portal來做資料的查詢,而階層式的資料查詢在BigQuery中會自動的做flatten的動作,因此呈現的資料表仍然是1個column,但是資料欄位名稱會以"."的方式將主欄位與階層欄位呈現出來,如下所示:



查詢的當下,也可以此方式指定條件或group by等等的方式。

留言

這個網誌中的熱門文章

Share a chrome plugin for manage google cloud platform

好玩意兒報報.... 同事的新作,把Google Project List在Chrome Plugin中! 對一次管理多個專案的人來說,真得超方便的拉! 下載: https://chrome.google.com/webstore/detail/gdclauncher/bicgkglnnilldakpenngnblekooejnpg 使用說明: 1. Use browser url bar to quick search: Type "gdcl" in browser Press "TAB" to start search Type the project id key word then select the search result... 2. Using quick launch bar... You can search by keyword or click project name to go to the project or gae link to go to gae or go to billing page....

透過Google Apps Script結合Google Form做即時郵件通知

體驗過Google Apps Script的功能後,也發現他結合GmailApps的模組 GmailApps的應用可以用在表單填寫完成後,做發信的通知 例如您開立了一個訂購的表單,為了要在第一時間通知商家有訂單進入 就可以直接呼叫Gmail做發信的通知,讓手持Smart Phone的我們可以很快的知道生意上門了! 下面規劃三個function,其中: onCommit():為form commit時候觸發的function,需要掛載於form commit trigger上 jsonArrToTable():目的將json array解析成為一個Table getLastRowTable():目的將整個table的回傳過濾為剩下第一筆(表頭,含有Form的欄位說明)與最後一筆(原則上就是剛剛送出的那一筆表單) 完整程式碼如下: function onCommit(){   var sheet = SpreadsheetApp.getActiveSheet();   var rows = sheet.getDataRange();   var numRows = rows.getNumRows();   var values = rows.getValues();   var content = getLastRowTable(values);   var htmlBody = "Hi Admin: <br/><br/>有訂單拉,檢查一下吧! <br/><br/>" + content + '<br/><br/>Send by Google Apps';   GmailApp.sendEmail(     " your-email-address@gmail.com ",      "Order Confirm Notice",      htmlBody,      {from: ' from-email-...

Cloud Monitor嚐鮮

GCP上,我們非常想要的一個功能,終於問世.... Cloud Monitor來了!