Data preparation
What's ETL?
Extract: From virtually any data source, from flat to complex files, databases, or cloud services
Transform: This allows us to modify or enrich the extracted information without modifying the source
Load: Once the transformation is done, it's responsible for loading the result to the data model
Transform data with Power Query
Power Query
It's a data connection technology that enables us to discover, connect, combine, and refine disparate data sources to meet our analytics needs
What does Power Query do?
Extract: From virtually any data source
Transform: Merge, combine, clean or enrich the data
Load: The data for further analysis in Power BI
Points to consider
The purpose of Power Query is to get data from a variety of sources and to get ready for further analysis
The purpose of this tool isn't to analyze the data
Magic: It's the collection of steps that are carried out to reach a result, it also allows us to go back or forward these, without modifying the data source. It's similar to the process that a macro performs in Excel
Data types
Unlike other programs like Excel, where data types are very flexible, in Power BI we need to define them well before we can use them
As in relational databases, each column must have its type well-defined to avoid errors
Classes:
Integer: Represents a 64-bit integer value (8 bytes). Since it is an integer, it has no digits to the right of the decimal place. Allows 19 digits; positive or negative integers between – 9,223,372,036,854,775,807 – ($2^{63}+1$) and 9,223,372,036,854,775,806 ($2^{63} – 2$ ). You can represent the largest possible precision of the various numeric data types. As with the Fixed Decimal Number type, the Integer type can be useful in cases where we need to control rounding
Decimal: Represents a 64-bit floating point number (8 bytes). It is the most common type of number. Although it is designed to handle numbers with fractional values, it also handles whole numbers. The Decimal Number type can handle negative values from to , and positive values from to . For example, numbers like 34, 34.01, and 34.000367063 are valid decimal numbers. The largest precision that can be represented in a decimal number type is 15 digits. The decimal separator can be placed anywhere in the number. The decimal number type corresponds to the way Excel stores its numbers. Note that a binary floating point number cannot represent all numbers within its supported range with 100% precision. Therefore, small differences in precision may occur when representing certain decimal numbers.
Fixed decimal: Also known as currency type, this data type has a fixed location for the decimal separator. The decimal separator always has four digits to the right and allows 19 significant digits. The largest value it can represent is 922,337,203,685,477.5807 (positive or negative). Unlike the decimal number, the fixed decimal number type is always precise and is therefore useful in cases where the imprecision of the floating point notation could introduce errors
Text: String of Unicode character data. They can be strings, numbers, or dates represented in a text format. The maximum string length is 268,435,456 Unicode characters (where each Unicode character is 2 bytes) or 536,870,912 bytes
Date/time: Represents a date and time value. The date and time values are stored as a type of decimal number, so you can convert from one to the other. The time part of a date is stored as a fraction in integer multiples of 1/300 seconds (3.33 ms). Dates between the years 1900 and 9999 are supported
Time: Represents only the time (without date). When converted to the model, the time value is the same as the date/time value with no digits to the left of the decimal position
Date/time/time zone: Represents a UTC date and time with a time zone offset. Converts to date/time when loaded into the model
Duration: Represents a period of time, which is converted to a type of decimal number when loaded into the model. As a type of decimal number, it can be added to or subtracted from a date/time field with the correct results. Since it is a type of decimal number, you can easily use it in visualizations that show magnitude
Boolean: Boolean value of True or False
Binary: The binary data type can be used to represent any other data in a binary format
Any: The Any data type is the state given to a column that does not have an explicit data type definition. Any is the data type that sorts all values. It is recommended to always explicitly define column data types for queries from unstructured sources and avoid having columns with data type Any as query output
Transformations
Transform: Shape the data
Common transformations:
Rename columns or tables
Change the data type (convert text to numbers, for example)
Remove rows
Set the first row as header
Add columns
Split columns
Replace values
Filter data
Combinations
Combining: Connecting to two or more data sources, shaping them as needed, and then consolidating them into a useful query
Common combinations:
Append queries
Combine queries
Combine binaries
Append
Allows us to join two or more tables
It is recommended that both have the same structure, if not, the system adds the non-matching fields with null values to the final set
It is similar to a standard SQL UNION operation
Results can be a new query or added to an existing step
Steps:
In Transform data (Power Query)
Menu Merge/Append queries
Combine
It allows us to take two tables and cross them through one or more fields in common
Usually used to complement the information in a table
It is the closest equivalent to the JOIN function of the SQL standard
Steps:
Tables are loaded normally
In Transform data (Power Query)
We choose the table (of the two), which we want to combine with the other
Menu Combine/Combine queries to create a new one
We choose the combination field(s)
We choose the type of combination
We choose the fields of the second table that we want to obtain
Combine binaries
This functionality allows us to extract the tables from the files through an automated process
It's usually used through the folder connector
It's especially useful when the information source is too fragmented for the append operation
Steps:
They provide us with an Excel file of sales on a weekly or monthly basis (for example)
The fields within the files must have the same structure and also have the same name
We will choose a field to combine the files in the operation
Go to: Get data / More / Folder and connect
Combine button
Combine and Load option
In the Combine files screen, we choose the combination parameter (one of the tables)
Extract and Append
Last updated