Category: Uncategorized

  • Use KQL effectively

    Understanding how to write efficient KQL queries is essential for getting good performance when working with Eventhouses. This unit covers key optimization techniques and explains why they matter for your queries.

    Why query optimization matters

    Query performance in KQL databases depends on the amount of data processed. When you understand how KQL processes data, you can write queries that:

    • Run faster by reducing the data scanned – For example, instead of scanning millions of rows, filter early to process only thousands
    • Use fewer resources – For example, selecting only 3 columns instead of all 50 columns reduces processing overhead
    • Work reliably with growing data – For example, a query that works on 1 million rows today will still perform well when your data grows to 10 million rows

    The key principle is: the less data your query needs to process, the faster it runs.

    Understand key optimization techniques

    Filter data early and effectively

    Filtering reduces the amount of data that subsequent operations need to process, and KQL databases use indexes and data organization techniques that make early filtering especially efficient.

    Time-based filtering is effective because Eventhouses typically contain time-series data:

    kqlCopy

    TaxiTrips
    | where pickup_datetime > ago(30min)  // Filter first - uses time index
    | project trip_id, vendor_id, pickup_datetime, fare_amount
    | summarize avg_fare = avg(fare_amount) by vendor_id
    

    Order your filters by how much data they eliminate – put filters that eliminate the most data first. Think of it like a funnel: start with the filter that removes the most rows, then apply more specific filters to the remaining data:

    kqlCopy

    TaxiTrips
    | where pickup_datetime > ago(1d)    // Time filter first - eliminates most data
    | where vendor_id == "VTS"           // Specific vendor - eliminates some data  
    | where fare_amount > 0              // Value filter - eliminates least data
    | summarize trip_count = count()
    

    Reduce columns early

    Projecting or selecting only the columns you need reduces resource usage. This is especially important when working with wide tables that have many columns.

    kqlCopy

    TaxiTrips
    | project trip_id, pickup_datetime, fare_amount  // Select columns early
    | where pickup_datetime > ago(1d)                // Then filter
    | summarize avg_fare = avg(fare_amount)
    

    Optimize aggregations and joins

    Aggregations and joins are resource-intensive operations because they need to process and combine large amounts of data. How you structure them can significantly affect query performance.

    For aggregations, limit results when exploring data:

    kqlCopy

    TaxiTrips
    | where pickup_datetime > ago(1d)
    | summarize trip_count = count() by trip_id, vendor_id
    | limit 1000  // Limit results for exploration
    

    For joins, put the smaller table first. When joining tables, KQL processes the first table to match with the second table. Starting with a smaller table means fewer rows to process, making the join more efficient.

    kqlCopy

    // Good: Small vendor table first
    VendorInfo        
    | join kind=inner TaxiTrips on vendor_id
    
    // Avoid: Large taxi table first
    TaxiTrips         
    | join kind=inner VendorInfo on vendor_id

    https://lernix.com.my/dynamics-365-finance-training-courses-malaysia

  • Get started with an Eventhouse

    When you create an Eventhouse, a default KQL database is automatically created with the same name. An Eventhouse contains one or more KQL databases, where you can create tables, stored procedures, materialized views, functions, data streams, and shortcuts to manage your data. You can use the default KQL database or create other KQL databases as needed.

    Screenshot of an Eventhouse in Microsoft Fabric.

    Work with data in your Eventhouse

    There are several ways to access and work with data in a KQL database within an Eventhouse:

    Data ingestion

    You can ingest data directly into your KQL database from various sources:

    • Local files, Azure storage, Amazon S3
    • Azure Event Hubs, Fabric Eventstream, Real-Time hub
    • OneLake, Data Factory copy, Dataflows
    • Connectors to sources such as Apache Kafka, Confluent Cloud Kafka, Apache Flink, MQTT (Message Queuing Telemetry Transport), Amazon Kinesis, Google Cloud Pub/Sub
    Screenshot of the Get Data menu for an eventhouse in Microsoft Fabric.

    Database shortcuts

    You can create database shortcuts to existing KQL databases in other eventhouses or Azure Data Explorer databases. These shortcuts let you query data from external KQL databases as if the data were stored locally in your eventhouse, without actually copying the data.

    OneLake availability

    You can enable OneLake availability for individual KQL databases or tables, making your data accessible throughout the Fabric ecosystem for cross-workload integration with Power BI, Warehouse, Lakehouse, and other Fabric services.

    Query data in a KQL database

    To query data in a KQL database, you can use KQL or T-SQL in KQL querysets. When you create a KQL database, an attached KQL queryset is automatically created for running and saving queries.

    Basic KQL syntax

    KQL uses a pipeline approach where data flows from one operation to the next using the pipe (|) character. Think of it like a funnel – you start with an entire data table, and each operator filters, rearranges, or summarizes the data before passing it to the next step. The order of operators matters because each step works on the results from the previous step.

     Important

    KQL is case-sensitive for everything including table names, column names, function names, operators, keywords, and string values. All identifiers must match exactly. For example, TaxiTrips is different from taxitrips or TAXITRIPS.

    Here’s an example that shows the funnel concept:

    kqlCopy

    TaxiTrips
    | where fare_amount > 20
    | project trip_id, pickup_datetime, fare_amount
    | take 10
    

    This query starts with all data in the TaxiTrips table, filters it to show only trips with fares over $20, selects specific columns using the project operator, and uses the take operator to return the first 10 rows that match the criteria in the where clause.

    The simplest KQL query consists of a table name:

    kqlCopy

    TaxiTrips
    

    This returns all columns from the TaxiTrips table, but the number of rows displayed is limited by your query tool’s default settings.

    To retrieve a sample of data from potentially large tables, use the take operator:

    kqlCopy

    TaxiTrips
    | take 100
    

    This returns the first 100 rows from the TaxiTrips table, which is useful for exploring data structure without processing the entire table.

    You can also aggregate data:

    kqlCopy

    TaxiTrips
    | summarize trip_count = count() by taxi_id
    

    This returns a summary table showing the total number of trips (trip_count) for each unique taxi_id, effectively counting how many trips each taxi has made.

    Analyze data with KQL queryset

    KQL queryset provides a workspace for running and managing queries against KQL databases. The KQL queryset allows you to save queries for future use, organize multiple query tabs, and share queries with others for collaboration. The KQL queryset also supports T-SQL queries, allowing you to use T-SQL syntax alongside KQL for data analysis.

    You can also create data visualizations while exploring your data, rendering query results as charts, tables, and other visual formats.

    Screenshot of a visualization in a queryset.

    Use Copilot to assist with queries

    For AI-based assistance with KQL querying, you can use Copilot for Real-Time Intelligence

    When your administrator enables Copilot, you see the option in the queryset menu bar. Copilot opens as a pane to the side of the main query interface. When you ask a question about your data, Copilot generates the KQL code to answer your question.

    Screenshot of Copilot for Real-Time Intelligence.

    https://lernix.com.my/dynamics-365-field-service-training-courses-malaysia

  • New opportunities with generative AI

    Throughout history, technological advancements influenced the way we work. Today, AI is leading a new wave of change by introducing innovative ways to manage workplace tasks and redefining economic opportunities.

    The video explores how AI creates these new opportunities, highlighting its potential to enhance your professional goals and everyday tasks. From automating routine tasks to giving personalized career advice, AI is changing the landscape. Whether you’re aiming to improve your current role or explore new career directions, AI offers tools and insights to help you navigate these opportunities.

    https://go.microsoft.com/fwlink/?linkid=2287800

    Generative AI isn’t just a tool for automation, but a powerful ally in enhancing creativity, problem-solving, and career development.

    https://lernix.com.my/dynamics-365-customer-service-training-courses-malaysia

  • AI companion in content creation

    Generative AI offers various technologies to assist in idea creation. This video will cover a few examples, such as text-based generation, text-to-image generation, audio generation, and video generation. Each technology offers unique ways to bring your ideas to life.

    For example, text-based generation can help draft content or brainstorm new concepts. Text-to-image generation can transform your textual descriptions into vivid images. Audio generation can create music or sound effects based on your prompts. Video generation can produce visual content from text, simplifying the creation of engaging videos without requiring extensive editing skills.

    The video will show how you can use these technologies today, highlighting their potential to enhance creativity and streamline the idea creation process. Whether you’re exploring new ideas or bringing your visions to life, generative AI technology can significantly expand your creative toolkit.

    https://go.microsoft.com/fwlink/?linkid=2287701

    Generative AI technology offers a fresh set of tools for creativity, enabling you to push the limits of your aspirations and goals.

    https://lernix.com.my/devops-certification-training-courses-malaysia

  • Visualize with AI from text to image

    With generative AI, visualizing ideas is more accessible than ever. One of the most exciting advancements is its ability to transform text descriptions into images. This technology, known as text-to-image generation, uses AI models to interpret text into visual representations. Imagine you’ve written a short story and want illustrations to bring it to life. By describing a scene through text, AI can create illustrations that match your description in no time, helping you find inspiration quickly or even critically analyze your story visually to see if you want to make any changes to enhance it further.

    This technology isn’t limited to authors. Designers, marketers, educators, and anyone with a creative vision can benefit from it. For instance, a teacher could describe a historical event and have AI generate an image to make the lesson more engaging for students. A marketer could visualize a campaign concept before it’s executed. The possibilities are endless.

    This video will introduce you to text-to-image technology and how you can access it today in Microsoft Copilot. By leveraging tools like DALL·E, integrated into Microsoft Designer, you can turn your textual descriptions into vivid images, bridging the gap between imagination and reality. Whether you’re looking to enhance a story, create visuals for a project, or simply explore your creativity, text-to-image AI is a powerful tool at your disposal.

    https://go.microsoft.com/fwlink/?linkid=2287802

    Generative AI is unlocking new possibilities for the creative process. With text-to-image generation, you can bring your ideas to life visually. In the next lesson, you’ll learn how AI can further assist you in content creation, making it an invaluable companion in your creative journey.

    https://lernix.com.my/dell-emc-training-courses-malaysia

  • AI linguistics

    You might come across some interesting AI acronyms such as Large Language Models (LLMs) or Natural Language Generation (NLG). These acronyms are a part of a branch of AI called Natural Language Processing (NLP).

    These technologies enable computers to understand, generate, and respond to human language in new ways. From suggesting words while typing a message, to giving ideas for a creative project, AI provides further support for our creative endeavors.

    The video explains the meaning behind the acronyms and how this branch of AI is changing how humans interact with AI.

    https://go.microsoft.com/fwlink/?linkid=2287801

    Natural Language Processing (NLP) is a branch of AI that helps computers understand and respond to human language. Recently, Large Language Models (LLMs) improved NLP applications.

    LLMs are advanced AI models trained in large amounts of text data. They predict words in sequences, enabling them to perform tasks like text generation, summarization, translation, and classification.

    LLMs are a key part of generative AI, which focuses on creating new content. By using LLMs, generative AI can produce human-like text, making it a useful tool in content creation.

    https://lernix.com.my/hadoop-training-courses-malaysia

  • What is generative AI?

    Generative AI is transforming the approach to productivity. Recent advancements in AI greatly improved natural language processing and generation. These technologies now enable the creation of images, videos, texts, and audio from simple descriptions, transforming how you interact with technology.

    This video reveals how generative AI can provide support in the creative process and enhance productivity, highlighting its potential to reshape various industries by empowering people to achieve more with less. By automating repetitive tasks and providing creative suggestions, generative AI allows you to focus on what truly matters: envisioning innovative ideas, setting ambitious goals, and pursuing your dreams.

    https://go.microsoft.com/fwlink/?linkid=2287901

    Now that you have the foundations down, you might be curious about how it works. Generative AI functions through AI models, which are mathematical structures that learn from patterns in data using algorithms. There are several types of AI models with varying capabilities. Some AI models are designed to identify and classify information, while others, like generative AI models, excel in creating content.

    This video shows how the use of generative AI models varies based on one’s technical ability. For example, experts can customize these models for complex tasks, while beginners can use preexisting models or tools with minimal technical knowledge.

    https://go.microsoft.com/fwlink/?linkid=2287900

    Generative AI is changing the way you interact with technology, from creating content to helping with complex tasks. These advancements enable AI to understand and process human language, fostering a more natural interaction.

    https://lernix.com.my/citrix-training-courses-malaysia

  • Use delta tables with streaming data

    All of the data we explored up to this point has been static data in files. However, many data analytics scenarios involve streaming data that must be processed in near real time. For example, you might need to capture readings emitted by internet-of-things (IoT) devices and store them in a table as they occur. Spark processes batch data and streaming data in the same way, enabling streaming data to be processed in real-time using the same API.

    Spark Structured Streaming

    A typical stream processing solution involves:

    • Constantly reading a stream of data from a source.
    • Optionally, processing the data to select specific fields, aggregate and group values, or otherwise manipulating the data.
    • Writing the results to a sink.

    Spark includes native support for streaming data through Spark Structured Streaming, an API that is based on a boundless dataframe in which streaming data is captured for processing. A Spark Structured Streaming dataframe can read data from many different kinds of streaming source, including:

    • Network ports
    • Real time message brokering services such as Azure Event Hubs or Kafka
    • File system locations.

     Tip

    For more information about Spark Structured Streaming, see Structured Streaming Programming Guide in the Spark documentation.

    Streaming with Delta tables

    You can use a Delta table as a source or a sink for Spark Structured Streaming. For example, you could capture a stream of real time data from an IoT device and write the stream directly to a Delta table as a sink. You can then query the table to see the latest streamed data. Or you could read a Delta as a streaming source, enabling near real-time reporting as new data is added to the table.

    Using a Delta table as a streaming source

    In the following PySpark example, a Delta table is created to store details of Internet sales orders:

    SQLCopy

    %%sql
    CREATE TABLE orders_in
    (
            OrderID INT,
            OrderDate DATE,
            Customer STRING,
            Product STRING,
            Quantity INT,
            Price DECIMAL
    )
    USING DELTA;
    

    A hypothetical data stream of internet orders is inserted into the orders_in table:

    SQLCopy

    %%sql
    INSERT INTO orders_in (OrderID, OrderDate, Customer, Product, Quantity, Price)
    VALUES
        (3001, '2024-09-01', 'Yang', 'Road Bike Red', 1, 1200),
        (3002, '2024-09-01', 'Carlson', 'Mountain Bike Silver', 1, 1500),
        (3003, '2024-09-02', 'Wilson', 'Road Bike Yellow', 2, 1350),
        (3004, '2024-09-02', 'Yang', 'Road Front Wheel', 1, 115),
        (3005, '2024-09-02', 'Rai', 'Mountain Bike Black', 1, NULL);
    
    

    To verify, you can read and display data from the input table:

    PythonCopy

    # Read and display the input table
    df = spark.read.format("delta").table("orders_in")
    
    display(df)
    

    The data is then loaded into a streaming DataFrame from the Delta table:

    PythonCopy

    # Load a streaming DataFrame from the Delta table
    stream_df = spark.readStream.format("delta") \
        .option("ignoreChanges", "true") \
        .table("orders_in")
    
    

     Note

    When using a Delta table as a streaming source, only append operations can be included in the stream. Data modifications can cause an error unless you specify the ignoreChanges or ignoreDeletes option.

    You can check that the stream is streaming by using the isStreaming property which should return True:

    PythonCopy

    # Verify that the stream is streaming
    stream_df.isStreaming
    

    Transform the data stream

    After reading the data from the Delta table into a streaming DataFrame, you can use the Spark Structured Streaming API to process it. For example, you could count the number of orders placed every minute and send the aggregated results to a downstream process for near-real-time visualization.

    In this example, any rows with NULL in the Price column are filtered and new columns are added for IsBike and Total.

    PythonCopy

    from pyspark.sql.functions import col, expr
    
    transformed_df = stream_df.filter(col("Price").isNotNull()) \
        .withColumn('IsBike', expr("INSTR(Product, 'Bike') > 0").cast('int')) \
        .withColumn('Total', expr("Quantity * Price").cast('decimal'))
    

    Using a Delta table as a streaming sink

    The data stream is then written to a Delta table:

    PythonCopy

    # Write the stream to a delta table
    output_table_path = 'Tables/orders_processed'
    checkpointpath = 'Files/delta/checkpoint'
    deltastream = transformed_df.writeStream.format("delta").option("checkpointLocation", checkpointpath).start(output_table_path)
    
    print("Streaming to orders_processed...")
    

     Note

    The checkpointLocation option is used to write a checkpoint file that tracks the state of the stream processing. This file enables you to recover from failure at the point where stream processing left off.

    After the streaming process starts, you can query the Delta Lake table to see what is in the output table. There might be a short delay before you can query the table.

    SQLCopy

    %%sql
    SELECT *
        FROM orders_processed
        ORDER BY OrderID;
    

    In the results of this query, order 3005 is excluded because it had NULL in the Price column. And the two columns that were added during the transformation are displayed – IsBike and Total.

    OrderIDOrderDateCustomerProductQuantityPriceIsBikeTotal
    30012023-09-01YangRoad Bike Red1120011200
    30022023-09-01CarlsonMountain Bike Silver1150011500
    30032023-09-02WilsonRoad Bike Yellow2135012700
    30042023-09-02YangRoad Front Wheel11150115

    When finished, stop the streaming data to avoid unnecessary processing costs using the stop method:

    PythonCopy

    # Stop the streaming data to avoid excessive processing costs
    deltastream.stop()

    https://lernix.com.my/citrix-certification-malaysia

  • Work with delta tables in Spark

    You can work with delta tables (or delta format files) to retrieve and modify data in multiple ways.

    Using Spark SQL

    The most common way to work with data in delta tables in Spark is to use Spark SQL. You can embed SQL statements in other languages (such as PySpark or Scala) by using the spark.sql library. For example, the following code inserts a row into the products table.

    PythonCopy

    spark.sql("INSERT INTO products VALUES (1, 'Widget', 'Accessories', 2.99)")
    

    Alternatively, you can use the %%sql magic in a notebook to run SQL statements.

    SQLCopy

    %%sql
    
    UPDATE products
    SET ListPrice = 2.49 WHERE ProductId = 1;
    

    Use the Delta API

    When you want to work with delta files rather than catalog tables, it may be simpler to use the Delta Lake API. You can create an instance of a DeltaTable from a folder location containing files in delta format, and then use the API to modify the data in the table.

    PythonCopy

    from delta.tables import *
    from pyspark.sql.functions import *
    
    # Create a DeltaTable object
    delta_path = "Files/mytable"
    deltaTable = DeltaTable.forPath(spark, delta_path)
    
    # Update the table (reduce price of accessories by 10%)
    deltaTable.update(
        condition = "Category == 'Accessories'",
        set = { "Price": "Price * 0.9" })
    

    Use time travel to work with table versioning

    Modifications made to delta tables are logged in the transaction log for the table. You can use the logged transactions to view the history of changes made to the table and to retrieve older versions of the data (known as time travel)

    To see the history of a table, you can use the DESCRIBE SQL command as shown here.

    SQLCopy

    %%sql
    
    DESCRIBE HISTORY products
    

    The results of this statement show the transactions that have been applied to the table, as shown here (some columns have been omitted):

    versiontimestampoperationoperationParameters
    22023-04-04T21:46:43ZUPDATE{“predicate”:”(ProductId = 1)”}
    12023-04-04T21:42:48ZWRITE{“mode”:”Append”,”partitionBy”:”[]”}
    02023-04-04T20:04:23ZCREATE TABLE{“isManaged”:”true”,”description”:null,”partitionBy”:”[]”,”properties”:”{}”}

    To see the history of an external table, you can specify the folder location instead of the table name.

    SQLCopy

    %%sql
    
    DESCRIBE HISTORY 'Files/mytable'
    

    You can retrieve data from a specific version of the data by reading the delta file location into a dataframe, specifying the version required as a versionAsOf option:

    PythonCopy

    df = spark.read.format("delta").option("versionAsOf", 0).load(delta_path)
    

    Alternatively, you can specify a timestamp by using the timestampAsOf option:

    PythonCopy

    df = spark.read.format("delta").option("timestampAsOf", '2022-01-01').load(delta_path)
    

    https://lernix.com.my/cisco-certification-training-courses-malaysia

  • Optimize delta tables

    Spark is a parallel-processing framework, with data stored on one or more worker nodes. In addition, Parquet files are immutable, with new files written for every update or delete. This process can result in Spark storing data in a large number of small files, known as the small file problem. It means that queries over large amounts of data can run slowly, or even fail to complete.

    OptimizeWrite function

    OptimizeWrite is a feature of Delta Lake which reduces the number of files as they’re written. Instead of writing many small files, it writes fewer larger files. This helps to prevent the small files problem and ensure that performance isn’t degraded.

    Diagram showing how Optimize Write writes fewer large files.

    In Microsoft Fabric, OptimizeWrite is enabled by default. You can enable or disable it at the Spark session level:

    PythonCopy

    # Disable Optimize Write at the Spark session level
    spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", False)
    
    # Enable Optimize Write at the Spark session level
    spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", True)
    
    print(spark.conf.get("spark.microsoft.delta.optimizeWrite.enabled"))
    

     Note

    OptimizeWrite can also be set in Table Properties and for individual write commands.

    Optimize

    Optimize is a table maintenance feature that consolidates small Parquet files into fewer large files. You might run Optimize after loading large tables, resulting in:

    • fewer larger files
    • better compression
    • efficient data distribution across nodes
    Diagram showing how Optimize consolidates Parquet files.

    To run Optimize:

    1. In Lakehouse Explorer, select the … menu beside a table name and select Maintenance.
    2. Select Run OPTIMIZE command.
    3. Optionally, select Apply V-order to maximize reading speeds in Fabric.
    4. Select Run now.

    V-Order function

    When you run Optimize, you can optionally run V-Order, which is designed for the Parquet file format in Fabric. V-Order enables lightning-fast reads, with in-memory-like data access times. It also improves cost efficiency as it reduces network, disk, and CPU resources during reads.

    V-Order is enabled by default in Microsoft Fabric and is applied as data is being written. It incurs a small overhead of about 15% making writes a little slower. However, V-Order enables faster reads from the Microsoft Fabric compute engines, such as Power BI, SQL, Spark, and others.

    In Microsoft Fabric, the Power BI and SQL engines use Microsoft Verti-Scan technology which takes full advantage of V-Order optimization to speed up reads. Spark and other engines don’t use VertiScan technology but still benefit from V-Order optimization by about 10% faster reads, sometimes up to 50%.

    V-Order works by applying special sorting, row group distribution, dictionary encoding, and compression on Parquet files. It’s 100% compliant to the open-source Parquet format and all Parquet engines can read it.

    V-Order might not be beneficial for write-intensive scenarios such as staging data stores where data is only read once or twice. In these situations, disabling V-Order might reduce the overall processing time for data ingestion.

    Apply V-Order to individual tables by using the Table Maintenance feature by running the OPTIMIZE command.

    Screen picture of table maintenance with V-order selected

    Vacuum

    The VACUUM command enables you to remove old data files.

    Every time an update or delete is done, a new Parquet file is created and an entry is made in the transaction log. Old Parquet files are retained to enable time travel, which means that Parquet files accumulate over time.

    The VACUUM command removes old Parquet data files, but not the transaction logs. When you run VACUUM, you can’t time travel back earlier than the retention period.

    Diagram showing how vacuum works.

    Data files that aren’t currently referenced in a transaction log and that are older than the specified retention period are permanently deleted by running VACUUM. Choose your retention period based on factors such as:

    • Data retention requirements
    • Data size and storage costs
    • Data change frequency
    • Regulatory requirements

    The default retention period is 7 days (168 hours), and the system prevents you from using a shorter retention period.

    You can run VACUUM on an ad-hoc basis or scheduled using Fabric notebooks.

    Run VACUUM on individual tables by using the Table maintenance feature:

    1. In Lakehouse Explorer, select the … menu beside a table name and select Maintenance.
    2. Select Run VACUUM command using retention threshold and set the retention threshold.
    3. Select Run now.
    Screen picture showing the table maintenance options.

    You can also run VACUUM as a SQL command in a notebook:

    SQLCopy

    %%sql
    VACUUM lakehouse2.products RETAIN 168 HOURS;
    

    VACUUM commits to the Delta transaction log, so you can view previous runs in DESCRIBE HISTORY.

    SQLCopy

    %%sql
    DESCRIBE HISTORY lakehouse2.products;
    

    Partitioning Delta tables

    Delta Lake allows you to organize data into partitions. This might improve performance by enabling data skipping, which boosts performance by skipping over irrelevant data objects based on an object’s metadata.

    Consider a situation where large amounts of sales data are being stored. You could partition sales data by year. The partitions are stored in subfolders named “year=2021”, “year=2022”, etc. If you only want to report on sales data for 2024, then the partitions for other years can be skipped, which improves read performance.

    Partitioning of small amounts of data can degrade performance, however, because it increases the number of files and can exacerbate the “small files problem.”

    Use partitioning when:

    • You have very large amounts of data.
    • Tables can be split into a few large partitions.

    Don’t use partitioning when:

    • Data volumes are small.
    • A partitioning column has high cardinality, as this creates a large number of partitions.
    • A partitioning column would result in multiple levels.
    Diagram showing partitioning by one or more columns.

    Partitions are a fixed data layout and don’t adapt to different query patterns. When considering how to use partitioning, think about how your data is used, and its granularity.

    In this example, a DataFrame containing product data is partitioned by Category:

    PythonCopy

    df.write.format("delta").partitionBy("Category").saveAsTable("partitioned_products", path="abfs_path/partitioned_products")
    

    In the Lakehouse Explorer, you can see the data is a partitioned table.

    • There’s one folder for the table, called “partitioned_products.”
    • There are subfolders for each category, for example “Category=Bike Racks”, etc.
    Screen picture of the lakehouse explorer and the product file partitioned by category.

    We can create a similar partitioned table using SQL:

    SQLCopy

    %%sql
    CREATE TABLE partitioned_products (
        ProductID INTEGER,
        ProductName STRING,
        Category STRING,
        ListPrice DOUBLE
    )
    PARTITIONED BY (Category);
    

    https://lernix.com.my/ccnp-certification-training-courses-malaysia