What is a data type in Google BigQuery?
In this article, we discover what data types are, and how they are used in Google BigQuery. We also look at commonly used data types and show some examples.
We also encounter some data types in BigQuery that are very similar.
Should we use a FLOAT
vs NUMERIC
? Is there a difference if we choose one over the other?
We also touch on common topics, such a knowing the largest or smallest number we can assign, or the longest chunk of text we can store.
Dates are tricky, though we know that your date last week was also a tricky one. In this article, we refer to calendar dates and time information in SQL. We also go over these briefly.
Finally, there are scenarios when we need to convert from one data type to another. We have examples of how to use CAST
and CONVERT
for this purpose.
What is a Data Type in BigQuery?
If you are familiar with spreadsheets such as Google Sheets or Microsoft Excel, you will notice that you can specify each column, row, or cell to follow a particular format.
For example, a number can be formatted either as a two-decimal currency, or just as a plain text.
Data types are important. If we know that a collection of data refers to test scores, we know what we can do with it. We can get the highest score, lowest, average, and so on.
When we assign a data type, we know what to expect for a value, and we know the kind of data that we are dealing with.
We can have numbers, characters, dates, fractions, you name it.
Google BigQuery supports a number of data types, namely:
- Numbers and fractions
- Characters
- Boolean (True or False)
- Bytes (raw data)
- Date and Time
- Arrays (a group of data)
- Struct (customized data)
- Geography
We go over these in the following sections.
List of All Data Types in BigQuery
We have the table below to show us the different data types in BigQuery, along with their names, descriptions, and keywords.
For example, if we are dealing with the number of children, we strictly need numbers without decimal parts. In this case, we use Integers represented by the keyword INT64
.
Numbers and Fractions
Integer (INT64 ) | Whole numbers only, including negative numbers |
Numeric (NUMERIC ) | Includes decimal numbers with up to 38 digits (including decimal values) |
Float (FLOAT64 ) | Includes decimal numbers, but with a certain loss of accuracy |
Characters
String (STRING ) | Characters that are stored and displayed in UTF-8 encoding |
Boolean
Boolean (BOOL ) | True or False only, and is case-insensitive |
Bytes
Bytes (BYTES ) | Characters that may or may not be in UTF-8. |
Date and Time
Date (DATE ) | Includes year, month, date |
Time (TIME ) | Includes time (with up to 6 decimal places for seconds) |
Datetime (DATETIME ) | Includes both date and time |
Timestamp (TIMESTAMP ) | Includes date, time, and the time zone |
Arrays
Array (ARRAY ) | A list of data with the same type. The order is important. An array can also be empty. For example, [-100, 0, 1, 5, 20] |
Struct
Struct (STRUCT ) | A set of customized data. We can represent each point on the map with a pair of numbers, their latitude, and longitude. Since any point will always have the longitude-latitude pair, we can create customized data that has two numbers (X, Y) for this purpose. |
Geography
Geography (GEOGRAPHY ) | Since spatial data is widely used, this data type can readily support location information. |
BigQuery Float vs Numeric Data Type
What is the difference between a FLOAT
and NUMERIC
data type?
Numeric types can handle a higher level of precision, at the cost of extra storage. Float types save storage and querying costs, at the cost of precision.
RELATED: What you need to know about Google BigQuery Pricing
Let’s see an example using the BigQuery platform. Here, we execute the following query, and get the following results.
SELECT 4.1 * 100 default_float, CAST(4.1 AS NUMERIC) * 100 use_numeric ;
This is how it should look like in the BigQuery Console:
Results:
Looking at the two columns, we see a certain loss of precision if we use FLOAT
. This does not happen all the time, and in most cases, the offset is negligible.
However, there are cases when accuracy is the top priority, e.g. dealing with financial information, or statistical analysis with very small numbers. In these cases, we need to use the NUMERIC
type.
In summary,
- Use
FLOAT
to save storage and query costs, with a manageable level of precision - Use
NUMERIC
for accuracy in the case of financial data, with higher storage and query costs
BigQuery String Max Length
With this, I tried an experiment.
I created sample text files and added them into a table in GBQ as a new table.
I experimented with strings of length 1, 2, 4, 8, and so on.
The maximum that I tried is 2^21, or 2,097,152 characters. That’s exactly 2 MB worth of data into a single cell in BigQuery! Insane!
But you know what’s more surprising? BigQuery still accepted these strings.
I don’t think a normal business user will need more than 2 million characters for a single string, so I stopped at that.
BigQuery accepts strings with a length of more than 2 million characters.
Warning: Querying with More Than 1024 Characters
As you work with very large strings above 1024 characters, you may start to get warnings after running a query, as shown below:
The warning can be similar to the following:
Some cell values are very long and the display is truncated to the first 1024 characters to improve browser performance. If full values are necessary, try lowering the number of rows per page before clicking "Show full values".
Since BigQuery is accessed through a browser, the platform limits the display of each data to at most 1024.
Solution: This does not affect the actual data in your database. To get the full information, export your data instead.
Querying with Large Strings
Additional Information: By the time I am querying strings with a length of 2 MB each, BigQuery creates a temporary table as part of the querying process. This can be a handling by the platform and did not affect my experience as an end user.
BigQuery Timestamp and Datetime
Do you need to store information about time, date, and time zone?
These are handled in BigQuery. We use the TIME
and DATE
data types to handle only the time and date components, respectively.
In most cases, we want to store both the date and time component. We can use the DATETIME
data type to store information such as events. In this data type, we assume that all dates follow the same local time zone.
However, for data that is handled through various time zones, it is best to use a TIMESTAMP
data type. This keeps information on the date, time, and time zone.
DATETIME Data Type | TIMESTAMP Data Type |
---|---|
05-31-2020 23:59:59.999999 | Using supported time zones 05-31-2020 23:59:59.999999 America/Los_Angeles Using time zone offsetting 05-31-2020 23:59:59.999999-8:00 |
Does not store information on time zone | Stores information on time zone. If not specified, defaults to UTC. |
BigQuery Cast and Convert Data Types
If we need to convert from one data type to another, we can use the CAST
function.
CAST(expression AS output_data_type)
For example, if we want to form a sentence, we need to convert all data (including numbers) into a STRING data type first. Mixing different data types is not recommended and can cause unexpected behavior.
Funnily, I attempted to convert my name into an integer, which is clearly wrong.
BigQuery replied by sending me 5 error messages telling me it’s not such a good idea.
Handling Errors in Conversion Using the SAFE_CAST Function
Sometimes, we process conversion for an entire column. In cases where some rows can cause an error, but we still want to proceed with the conversion, we use SAFE_CAST
.
SAFE_CAST(expression AS output_data_type)
If a cast is not possible, that specific data will return NULL
, while the others are converted successfully.
Conclusion
Data types in BigQuery work very similarly with other applications such as Microsoft Excel and other SQL Platforms.
It is always best to know our purpose for each data, so we can select the correct data type for that purpose.
In cases where we need to convert from one data type to another, we can use the CAST
and SAFE_CAST
functions.
You can also look at the Google BigQuery page for their official documentation on data types.
Pingback: 8 Google BigQuery Data Types: A Comprehensive Guide
Comments are closed.