<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>BigQuery on R Views</title>
    <link>https://rviews.rstudio.com/tags/bigquery/</link>
    <description>Recent content in BigQuery on R Views</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <lastBuildDate>Fri, 02 Feb 2018 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://rviews.rstudio.com/tags/bigquery/" rel="self" type="application/rss+xml" />
    
    
    
    
    <item>
      <title>Cost-Effective BigQuery with R</title>
      <link>https://rviews.rstudio.com/2018/02/02/cost-effective-bigquery-with-r/</link>
      <pubDate>Fri, 02 Feb 2018 00:00:00 +0000</pubDate>
      
      <guid>https://rviews.rstudio.com/2018/02/02/cost-effective-bigquery-with-r/</guid>
      <description>
           



&lt;div id=&#34;introduction&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Companies using Google BigQuery for production analytics often run into the following problem: the company has a large user hit table that spans many years. Since queries are billed based on the fields accessed, and not on the date-ranges queried, queries on the table are billed for all available days and are increasingly wasteful.&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;/post/2018-01-30-BigQuery_files/partition.png&#34; /&gt;

&lt;/div&gt;
&lt;p&gt;A solution is to partition the table by date, so that users can query a particular range of dates; saving costs and decreasing query duration. Partitioning an un-partitioned table can be expensive if done the brute-force way. This article explores one cost-effective partitioning method, and uses the &lt;a href=&#34;https://cran.r-project.org/package=condusco&#34;&gt;condusco&lt;/a&gt; R Package to automate the query generation and partitioning steps.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;migrating-non-partitioned-tables-to-partitioned-tables-in-google-bigquery&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Migrating non-partitioned tables to partitioned tables in Google BigQuery&lt;/h2&gt;
&lt;p&gt;Let’s implement the accepted solution on StackOverflow for &lt;a href=&#34;https://stackoverflow.com/questions/38993877/migrating-from-non-partitioned-to-partitioned-tables&#34;&gt;migrating from non-partitioned to partitioned tables in Google BigQuery&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The brute-force way to partition a non-partitioned table is to repeatedly query the table for anything matching a particular day and then save that data to a new table with the date suffix, ie. _20171201.&lt;/p&gt;
&lt;p&gt;The problem is the cost for this method is the cost of querying the full table’s worth of data, multiplied by the number of days it needs to be partitioned into. For a 10 Terabyte table spanning three years, one SELECT * might cost $50 (BigQuery charges $5 per TB accessed). Hence, splitting the table into three years of daily partitions will cost $50*365*3 = &lt;strong&gt;$54,750&lt;/strong&gt;!&lt;/p&gt;
&lt;p&gt;The more cost-effective &lt;a href=&#34;https://stackoverflow.com/questions/38993877/migrating-from-non-partitioned-to-partitioned-tables&#34;&gt;solution&lt;/a&gt; described on StackOverflow is to ARRAY_AGG the entire table into one record for each day. This requires one query over the table’s data to ARRAY_AGG each day you are interested in, and then multiple UNNEST queries using a single query on a single column.&lt;/p&gt;
&lt;p&gt;This solution queries the full table’s worth of data twice, instead of the number of days. That’s a cost of $100, saving &lt;strong&gt;$54,650&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Here is an implementation of the solution using &lt;a href=&#34;https://github.com/ras44/condusco&#34;&gt;condusco&lt;/a&gt; to automate both the query generation and the partitioning:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(whisker)
library(bigrquery)
library(condusco)

# Set GBQ project
project &amp;lt;- &amp;#39;&amp;lt;YOUR_GBQ_PROJECT_ID_HERE&amp;gt;&amp;#39;

# Configuration
config &amp;lt;- data.frame(
  dataset_id = &amp;#39;&amp;lt;YOUR_GBQ_DATASET_ID_HERE&amp;gt;&amp;#39;,
  table_prefix = &amp;#39;tmp_test_partition&amp;#39;
)

# Set the following options for GBQ authentication on a cloud instance
options(&amp;quot;httr_oauth_cache&amp;quot; = &amp;quot;~/.httr-oauth&amp;quot;)
options(httr_oob_default=TRUE)

# Run the below query to authenticate and write credentials to .httr-oauth file
query_exec(&amp;quot;SELECT &amp;#39;foo&amp;#39; as bar&amp;quot;,project=project);

# The pipeline that creates the pivot table
migrating_to_partitioned_step_001_create_pivot &amp;lt;- function(params){
  
  destination_table &amp;lt;- &amp;quot;{{{dataset_id}}}.{{{table_prefix}}}_partitions&amp;quot;
  
  query &amp;lt;- &amp;quot;
  SELECT
    {{#date_list}}
    ARRAY_CONCAT_AGG(CASE WHEN d = &amp;#39;day{{{yyyymmdd}}}&amp;#39; THEN r END) AS day_{{{yyyymmdd}}},
    {{/date_list}}
    line
  FROM (
    SELECT d, r, ROW_NUMBER() OVER(PARTITION BY d) AS line
    FROM (
      SELECT 
        stn, CONCAT(&amp;#39;day&amp;#39;, year, mo, da) AS d, ARRAY_AGG(t) AS r
      FROM `bigquery-public-data.noaa_gsod.gsod2017` AS t 
      GROUP BY stn, d
    ) 
  )
  GROUP BY line
  &amp;quot;
  
  query_exec(whisker.render(query,params),
             project=project,
             destination_table=whisker.render(destination_table, params),
             write_disposition=&amp;#39;WRITE_TRUNCATE&amp;#39;,
             use_legacy_sql = FALSE
  );
  
}

# Run the pipeline that creates the pivot table

# Create a JSON string in the invocation query that looks like [{&amp;quot;yyyymmdd&amp;quot;:&amp;quot;20171206&amp;quot;},{&amp;quot;yyyymmdd&amp;quot;:&amp;quot;20171205&amp;quot;},...]
invocation_query &amp;lt;- &amp;quot;
  SELECT
    &amp;#39;{{{dataset_id}}}&amp;#39; as dataset_id,
    &amp;#39;{{{table_prefix}}}&amp;#39; as table_prefix,
    CONCAT(
      &amp;#39;[&amp;#39;,
      STRING_AGG(
        CONCAT(&amp;#39;{\&amp;quot;yyyymmdd\&amp;quot;:\&amp;quot;&amp;#39;,FORMAT_DATE(&amp;#39;%Y%m%d&amp;#39;,partition_date),&amp;#39;\&amp;quot;}&amp;#39;)
      ),
      &amp;#39;]&amp;#39;
    ) as date_list
  FROM (
    SELECT
    DATE_ADD(DATE(CURRENT_DATETIME()), INTERVAL -n DAY) as partition_date
    FROM (
      SELECT [1,2,3] as n
    ),
    UNNEST(n) AS n
  )
&amp;quot;

run_pipeline_gbq(
  migrating_to_partitioned_step_001_create_pivot,
  whisker.render(invocation_query,config),
  project,
  use_legacy_sql = FALSE
)

# The pipeline that creates the individual partitions 
migrating_to_partitioned_step_002_unnest &amp;lt;- function(params){
  
  destination_table &amp;lt;- &amp;quot;{{{dataset_id}}}.{{{table_prefix}}}_{{{day_partition_date}}}&amp;quot;
  
  query &amp;lt;- &amp;quot;
    SELECT r.*
    FROM {{{dataset_id}}}.{{{table_prefix}}}_partitions, UNNEST({{{day_partition_date}}}) as r
  &amp;quot;
  
  query_exec(whisker.render(query,params),
             project=project,
             destination_table=whisker.render(destination_table, params),
             write_disposition=&amp;#39;WRITE_TRUNCATE&amp;#39;,
             use_legacy_sql = FALSE
  );
  
}

invocation_query &amp;lt;- &amp;quot;
  SELECT
    &amp;#39;{{{dataset_id}}}&amp;#39; as dataset_id,
    &amp;#39;{{{table_prefix}}}&amp;#39; as table_prefix,
    CONCAT(&amp;#39;day_&amp;#39;,FORMAT_DATE(&amp;#39;%Y%m%d&amp;#39;,partition_date)) as day_partition_date
  FROM (
    SELECT
      DATE_ADD(DATE(CURRENT_DATETIME()), INTERVAL -n DAY) as partition_date
    FROM (
      SELECT [1,2,3] as n
    ),
    UNNEST(n) AS n
  )
&amp;quot;
run_pipeline_gbq(
  migrating_to_partitioned_step_002_unnest,
  whisker.render(invocation_query,config),
  project,
  use_legacy_sql = FALSE
)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

        &lt;script&gt;window.location.href=&#39;https://rviews.rstudio.com/2018/02/02/cost-effective-bigquery-with-r/&#39;;&lt;/script&gt;
      </description>
    </item>
    
  </channel>
</rss>
