My AI-tweaked Data Saturdays 23 Notes

Mutt0-ds | Nov 19, 2023 min read

Last Saturday I had the privilege of attending Data Saturdays #37 at Parma, an event series focused on Microsoft data platform technologies. Covering AI, Database, Business Intelligence, and Cloud, the event promised insights from top experts in Italy (and beyond).

Of course, I wrote down (almost) everything I heard. This blog post is actually a dump of my notes, which I tried to partially automated with a mix of OCR + ChatGPT, which struggled in understanding most of the complex contexts… My disorganized note management didn’t help, poor GPT.

So, my key points are slightly “tweaked” using AI and then carefully checked and simplified by me. I hope they serve as a useful resource for those interested in the topics discussed at the event. Kudos to everyone involved in making this event a success (organizers, speakers, sponsors), and I look forward to future editions that continue to push the boundaries of knowledge-sharing in the data and technology domain.

Special thanks to Ema for the support.

Update: most of the presentations have been uploaded on GitHub: you can find them here

☁ SQL Server PaaS by Gianluca Hotz

Three Solutions

  • SQL Server On-Premises: Classic solution.
  • IaaS (Infrastructure as a Service): Utilizes a purchased VM for MSSQL. Eliminates the need for VM maintenance.
  • PaaS (Platform as a Service): Azure offers three PaaS options for SQL Server. RDS (Relational Data Service) available for MySQL, PostgreSQL, MariaDB.

Three PaaS Database Running Options in Azure

  • DB Single: Requires manual management of the instance, including network and security.
  • Elastic Pool: A shared resource pool for multiple databases. Scales dynamically with an auto-scale feature.
  • Managed Instance: Entire SQL Server instance with PaaS advantages.

Purchase Models, Tiers, Options

  1. DTU (Database Transaction Unit): Estimates operations using a combination of CPU and storage. Simple and linear, though may be phased out in the future. Tiers include Free, Standard, Premium.
  2. vCores: Renting computing power. Tiers include General Purpose, Business Critical, Hyperscale (more on it later).

You can also choose between

  • Provisioned: Adheres to a traditional model with a fixed set of resources.
  • Serverless: Payment based on the minimum core and actual usage. Allows for automation, e.g., turning off during inactive periods.

Hyperscale

Utilizes a cluster-based model with modularized SQL, resembling AWS Aurora. Complexity in design but offers benefits like distribution and quorum-based operations. However, this is a new solution.

SQL Managed Instance Details

  • Entire SQL Server instance with PaaS advantages.
  • Azure manages hardware and all what’s under the hood.
  • Convenient, duplicated, and region-secured backups. This is already a game changer.
  • High availability through ready clones, with consideration for potential connection disruption during a switch.

The AWS offer

  • Different offerings from Azure, but fundamental features are similar.
  • Azure holds exclusivity on SQL Server; AWS updates may have a delayed rollout.

Fleet Manager - New Azure Feature

  • Thumb’s rule: one DB per tenant.
  • Fleet Manager assists in scaling numerous databases (in the order of thousands).
  • Databases can be categorized into tiers for balanced priority and power.
  • Kind of a pool of pools

📐 T-SQL performance tips & tricks by Sergio Govoni

1. Sargable Predicates

  • Learn to read Execution Plans; Index Seek good, Index Scan worse.
  • SARG (Search ARGument) - Always aim to use an index to optimize query performance.
  • Identify indexed columns and avoid operations on the predicates that hinder index usage. Example: If there’s an index on the date column, refrain from using functions like WHERE YEAR(date) = 2023 to prevent losing the index. This also works for ORDER BY not only where
  • Pay attention to handling nulls correctly (example to check: stored procedure ignoring NULLs in where, fixed with EXISTS + INTERSECT). NULL is a placeholder for something that may arrive, not a value!

2. Query Mode Execution

  • Again: carefully analyze the Execution Plan.
  • Prioritize BATCH over ROW mode (Estimated Execution Mode); reason by cluster of column value rather than each row.
  • Utilize a nonclustered columnstore index (empty) for faster queries. Check if the Execution Mode updated accordingly!
  • In Azure SQL and from SQL Server 2019, the engine employs batch mode where possible without creating the column index.Better because for transactional databases, it’s better to avoid indexes. Check compatibility levels and enable the feature accordingly.
  • There also POC Indexes for PARTITION BY queries

3. Join Order

  • The Query Optimizer (very complex) aims for the easiest path to make joins but might be incorrect.
  • Pay attention to join order; joins are evaluated based on the ON statement, if you want to join something inner first and then a left outer later, the ON of the left outer should be AFTER the first. Use parentheses for readability.

4. Tempdb

  • Tempdb is the public toilet of SQL Server, receiving a lot of activity. Serves as a shared space for temporary data.
  • Place Tempdb on a fast machine due to high traffic.
  • Create an index with a name to cache Tempdb and avoid creating new indexes for each query if they have anonymous indexes.

🧯 SQL Server On Fire by Gabriele Franco

Tools Used

  • Query Store - The black box of SQL Server, storing query stats and history. (note: read about Hint Query Store)
  • Extended Events (XE) - Monitors 3000 types of events for detailed insights.

Scenario: SQL Server at 100% CPU

  • Ran a stress test on the database to induce slowdown.
  • Analyzed Query Store to identify queries causing high resource consumption.
  • Utilized the estimated plan and adjusted the execution plan with an index to resolve the issue.
  • Demonstrated the comparison of queries within the Query Store tool.

Scenario: Memory Pressure Test

  • Executed a test under memory pressure with a poorly optimized table.
  • The query estimated 62 GB of memory, causing a bottleneck in the database.
  • Root cause: Inefficient use of the nvarchar(4000) type, leading to overestimated data size for millions of rows.
  • Remember the importance of defining the right data types!
  • Fun fact: SELECT TOP 100 is memory-optimized; TOP 101 is not.

Checking Active Sessions

  • Used sys.sp_whoisactive to identify resource-consuming sessions.
  • Addressed the issue of RESOURCE_SEMAPHORE, indicating exhausted resources.
  • Explored sys.query_store_plan for detailed information on query stats and resources.

Using Extended Events (XE)

  • Referenced the dm_xe_sessions table to observe and analyze extended events sessions.
  • Demonstrated downloading in XDL format for deadlocks (note: see more about that)

🔓 Unlocking the Power of AI-Driven Analytics: Mastering Generative AI in Power BI by Leon Gordon

The scenario

  • Deploying AI solutions to Production (PRD) is a challenging endeavor.
  • High failure rates are observed in many AI projects.
  • Case Study: PepsiCo’s AI Application for Cheetos Quality Check: AI used to identify irregularities in Cheetos for quality assurance

AI Facts

  • AI is a permanent and essential part of technology landscapes.
  • Effectively leveraging AI is crucial for project success.
  • Governance and security play a significant role in AI deployment.

Demo 1: ChatGPT Data Investigator with Marvel Movies Dataset

  • ChatGPT plugin showcased for uploading dirty data and cleaning up and transforming them into a schema suitable for Power BI. Basic visualization available.
  • Despite being a demo tool, it could be seen as a foundational step for future developments.

Demo 2: Data Flow with HTTP Call to OpenAI

  • Tested a data flow with an HTTP call to OpenAI for applying a column description using ChatGPT.
  • There are potential cost implications in a Production environment but it’s interesting to leverage HTTP calls using M PowerQuery language.

✖ Power BI “X Files” - 5 things they “forgot” to tell you! by Nikola Ilic

  • Very technical session for Power BI advanced users, a real showcase of hidden gems for optimizing queries and calculation

DAX Fusion

  • Horizontal Fusion: Combines calculations with different filters for improved performance. Enabled after September 2022.
  • Vertical Fusion:Classic fusion using filter context (measures).
  • No action needed; enabled in newer versions for query optimization. I should read the SQL explanation post.

Cache Me If You Can

  • Power BI caches results to enhance performance. E.g. double-clicking the same slicer item results in a single query.
  • Use Performance Analyzer to copy the query and check results in DAX Studio.
  • Avoid Beginner Mistake: For accurate testing, clean the cache before each new run.
  • Note: Cache may not be utilized for large queries.

The Dark Art of Table Processing

  • Leveraging columnar nature of VertiPaq (the database under the hood) for optimized refresh can be very important.
  • Refer to Phil Seamark’s blog post for insights into the refresh strategy.

Fold on Tight

  • Query Folding: the ability to move computation in SQL query from M
  • Beware: Appending, merging queries, or transforming data types may bypass folding, slowing down queries.
  • Check native view query in Power Query; if absent, it’s not folding. This is bad.

MDX Optimization

👀 What to look out for when migrating your data to the cloud; is it REALLY the right choice? by Emanuele Meazzo

I didn’t take notes here as I work with Ema and kindly shared with me the entire presentation, which was rich in interesting points about the question one should me when choosing to migrate to the Cloud. It’s not that simple. Amazing performance that made me leave the room with more questions than answer.

🕘 Save The Date by Rudi Bruchez

This session explores the nuances of different date and time types in SQL Server, highlighting the strengths and challenges of each and emphasizing best practices for handling date-related operations. Suggested article: The Ultimate Guide to Datetime datatypes

Four Main Types:

  • DATETIME
  • DATE
  • DATETIME2
  • Smalldatetime (less commonly used)

Datetime is a mess

  • Starts at 1753 because it’s essentially an int.
  • Rounds milliseconds in a peculiar way, leading to rounding discrepancies.
  • No time zone information (UTC can be used).
  • ISO 8601 attempts to standardize datetime but can be messy due to updates over the years.
  • Language-sensitive, with the format dependent on the language set in SQL Management Studio.

Formatting

  • CONVERT with numeric IDs to choose the desired format.
  • CAST automatically converts to the user language format (faster).
  • FORMAT (a .NET function) is external and slower.

Date: better, but not perfect

  • Starts from year 1 but still has rounding errors.
  • Various interesting manipulations with types and casting can be performed.

Removing Time

Challenge: non-existing dates in the past before the adoption of the modern calendar; triggers can be used to address this and prevent inserts on those days.

Datetime2

  • Ideal solution; should be used more.
  • Ranges from 1 to 9999, follows the Gregorian calendar.
  • Precision can be chosen, and there are no tricks with casting ints.