Feature By 2015, Netflix had completed its move from an on-premises data warehouse and analytics stack to one based around AWS S3 object storage. But the environment soon began to hit some snags.
"Let me tell you a little bit about Hive tables and our love/hate relationship with them," said Ted Gooch, former database architect at the streaming service.
While there were some good things about Hive, there were also some performance-based issues and "some very surprising behaviors."
"Because it's not a heterogeneous format or a format that's well defined, different engines supported things in different ways," Gooch now a software engineer at Stripe and an Iceberg committer said in an online video posted by data lake company Dremio.
Out of these performance and usability challenges inherent in Apache Hive tables in large and demanding data lake environments, the Netflix data team developed a specification for Iceberg, a table format for slow-moving data or slow-evolving data, as Gooch put it. The project was developed at Netflix by Ryan Blue and Dan Weeks, now co-founders of Iceberg company Tabular, and was donated to the Apache Software Foundation as an open source project in November 2018.
Apache Iceberg is an open table format designed for large-scale analytical workloads while supporting query engines including Spark, Trino, Flink, Presto, Hive and Impala. The move promises to help organizations bring their analytics engine of choice to their data without going through the expensive and inconvenience of moving it to a new data store. It has also won support from data warehouse and data lake big hitters including Google, Snowflake and Cloudera.
Cloud-based blob storage like AWS S3 does not have a way of showing the relationships between files or between a file and a table. As well as making life tough for query engines, it makes changing schemas and time travel difficult. Iceberg sits in the middle of what is a big and growing market. Data lakes alone were estimated to be worth $11.7 billion in 2021, forecast to grow to $61.07 billion by 2029.
"If you're looking at Iceberg from a data lake background, its features are impressive: queries can time travel, transactions are safe so queries never lie, partitioning (data layout) is automatic and can be updated, schema evolution is reliable no more zombie data! and a lot more," Blue explained in a blog.
But it also has implications for data warehouses, he said. "Iceberg was built on the assumption that there is no single query layer. Instead, many different processes all use the same underlying data and coordinate through the table format along with a very lightweight catalog. Iceberg enables direct data access needed by all of these use cases and, uniquely, does it without compromising the SQL behavior of data warehouses."
In October, BigLake, Google Cloud's data lake storage engine, began support for Apache Iceberg, with Databricks format Delta and Hudi streaming set to come soon.
Speaking to The Register, Sudhir Hasbe, senior director of product management at Google Cloud, said: "If you're doing fine-grained access control, you need to have a real table format, [analytics engine] Spark is not enough for that. We had some discussion around whether we are going with Iceberg, Delta or Hudi, and our prioritization was based customer feedback. Some of our largest customers were basically deciding in the same realm and they wanted to have something that was really open, driven by the community and so on. Snap [social media company] is one of our early customers, all their analytics is [on Google Cloud] and they wanted to push us towards Iceberg over other formats."
He said Iceberg was becoming the "primary format," although Google is committed to supporting Hudi and Delta in the future. He noted Cloudera and Snowflake were now supporting Iceberg while Google has a partnership with Salesforce over the Iceberg table format.
Cloudera started in 2008 as a data lake company based on Hadoop, which in its early days was run on distributed commodity systems on-premises, with a gradual shift to cloud hosting coming later.
Today, Cloudera sees itself as a multi-cloud data lake platform, and in July it announced its adoption of the Iceberg open table format.
Chris Royles, Cloudera's Field CTO, told The Register that since it was first developed, Iceberg had seen steady adoption as the contributions grew from a number of different organizations, but vendor interest has begun to ramp up over the last year.
"It has lots of capability, but it's very simple," he said. "It's a client library: you can integrate it with any number of client applications, and they can become capable of managing Iceberg table format. It enables us to think in terms of how different clients both within the Cloudera ecosystem, and outside it the likes of Google or Snowflake could interact with the same data. Your data is open. It's in a standard format. You can determine how to manage, secure and own it. You can also bring whichever tools you choose to bear on that data."
The result is a reduction in the cost of moving data, and improved throughput and performance, Royles said. "The sheer volume of data you can manage the number of data objects you can manage and the complexity of the partitioning: it's a multiplication factor. You're talking five or 10 times more capable by using Iceberg as a table format."
Snowflake kicked off as a data warehouse, wowing investors with its so-called cloud-native approach to separating storage and compute, allowing a more elastic method than on-prem-based data warehousing. Since its 2020 IPO which briefly saw it hit a value of $120 billion the company has diversified as a cloud-based data platform, supporting unstructured data, machine learning language Python, transactional data and most recently Apache Iceberg.
James Malone, Snowflake senior product manager, told El Reg that cloud blob storage such as that offered by AWS, Google and Azure is durable and inexpensive, put could present challenges when it comes to performance analytics.
"The canonical example is if you have 1,000 Apache Parquet files, if you have an engine that's operating on those files, you have to go tell it if they these 1000 tables with one parquet file a piece or if it is two tables with 500 parquet files it doesn't know," he said. "The problem is even more complex when you have multiple engines operating on the same set of data and then you want things like ACID-compliance and like safe data types. It becomes a huge, complicated mess. As cheap durable cloud storage has proliferated it has also put pressure downward pressure on the problem of figuring out how to do high-performance analytics on top of that. People like the durability and the cost-effectiveness of storage, but they also there's a set of expectations and a set of desires in terms of how engines can work and how you can derive value from that data."
Snowflake supports the idea that Iceberg is agnostic both in terms of the file format and analytics engine. For a cloud-based data platform with a steadily expanding user base, this represents a significant shift in how customers will interact with and, crucially, pay for Snowflake.
The first and smallest move is the idea of external tables. When files are imported into an external table, metadata about the files is saved and a schema is applied on read when a query is run on a table. "That allows you to project a table on top of a set of data that's managed by some other system, so maybe I do have a Hadoop cluster that I have a meta store that that system owns the security, it owns the updates, it owns the transactional safety," Malone said. "External tables are really good for situation like that, because it allows you to not only query the data in Snowflake, but you can also use our data sharing and governance tools."
But the bigger move from Snowflake, currently only available in preview, is its plan to build a brand-new table type inside of Snowflake. It is set to have parity in terms of features and performance with a standard Snowflake table, but uses Parquets as the data format, and Iceberg as the metadata format. Crucially, it allows customers to bring their own storage to Snowflake instead of Snowflake managing the storage for them, perhaps a significant cost in the analytics setup. "Traditionally with the standard Snowflake table, Snowflake provides the cloud storage. With an Iceberg table, it's the customer that provides the cloud storage and that's a huge shift," Malone said.
The move promises to give customers the option of taking advantage of volume discounts negotiated with blob storage providers across all their storage, or negotiate new deals based on demand, and only pay Snowflake for the technology it provides in terms of analytics, governance, security and so on.
"The reality is, customers have a lot of data storage and telling people to go move and load data into your system creates friction for them to actually go use your product and is not generally a value add for the customer," Malone said. "So we've built Iceberg tables in a way where our platform benefits work, without customers having to go through the process of loading data into Snowflake. It meets the customer where they are and still provides all of the benefits."
But Iceberg does not only affect the data warehouse market, it also has an impact on data lakes and the emerging lakehouse category, which claims to be a useful combination of the data warehouse and lake concepts. Founded in 2015, Dremio places itself in the lakehouse category also espoused by Databricks and tiny Californian startup Onehouse.
Dremio was the first tech vendor to really start evangelizing Iceberg, according to co-founder and chief product officer Tomer Shiran. Unlike Snowflake and other data warehouse vendors, Dremio has always advocated an open data architecture, using Iceberg to bring analytics to the data, rather than the other way around, he said. "The world is moving in our direction. All the big tech companies have been built on an open data architect and now the leading banks are moving with them."
Shiran said the difference with Dremio's incorporation of Iceberg is that the company has used the table format to design a platform to support concurrent production workloads, in the same way as traditional data warehouses, while offering users the flexibility to access data where they have it, based on a business-level UI, rather than the approach of Databricks, for example, which is more designed with data scientists in mind.
While Databricks supports both its own Delta table standard and Iceberg, Shiran argues that Iceberg's breadth of support will help it win out in the long run.
"Neither is going away," Shiran said. "Our own query engine supports both table formats, but Iceberg is vendor agnostic and Apache marshals contributions from dozens companies including Netflix, Apple and Amazon. You can see how diverse it is but with Delta, although it is technically open source, Databricks is the sole contributor."
However, Databricks disputes this line. Speaking to The Register in November, CEO and co-founder Ali Ghodsi said there were multiple ways to justify Delta Lake as an open source project. "It's a Linux Foundation. We contribute a lot to it, but its governance structure is in Linux Foundation. And then there's the Iceberg and Hudi, which are both Apache projects."
Ghodsi argued the three table formats Iceberg, Hudi and Delta were similar and all were likely to be adopted across the board by the majority of vendors. But the lakehouse concept distinguishes Databricks from the data warehouse vendors even as they make efforts to adopt these formats.
"The data warehousing engines all say they support Iceberg, Hudi and Delta, but they're not optimized really for this," he said. "They're not incentivized to do it well either because if they do that well, then their own revenue will be cannibalized: you don't need to pay any more for storing the data inside the data warehouse. A lot of this is, frankly speaking, marketing by a lot of vendors to check a box. We're excited that the lakehouse actually is taking off. And we believe that the future will be lakehouse-first. Vendors like Databricks, like Starburst, like Dremio will be the way people want to use this."
Nonetheless, database vendor Teradata has eschewed the lakehouse concept. Speaking to The Register in October, CTO Stephen Brobst argued that a data lake and data warehouse should be discrete concepts within a coherent data architecture. The argument plays to the vendor's historic strengths in query optimization and supporting thousands of concurrent users in analytics implementations which include some of the world's largest banks and retailers.
Hyoun Park, CEO and chief analyst at Amalgam Insights, said most vendors are likely to support all three table formats Iceberg, Delta and Hudi in some form or other, but Snowflake's move with Iceberg is the most significant because it represents a departure for the data warehouse firm in terms of its cost model, but also how it can be deployed.
"It's going to continue to be a three-platform race, at least for the next couple of years, because Hudi benchmarks as being slower than the other two platforms but provides more flexibility in how you can use the data, how you can read the data, how you can ingest the data. Delta Lake versus Iceberg tends to be more of a commercial decision because of the way that the vendors have supported this basically, Databricks on one side and everybody else on the other," he said.
But when it comes to Snowflake, the argument takes a new dimension. Although Iceberg promises to extend the application of the data warehouse vendor's analytics engine beyond its environment potentially reducing the cost inherent in moving data that will come at a price: the very qualities that made Snowflake so appealing in the first place, Park said.
"You're now managing two technologies rather than simply managing your data warehouse which was which is the appeal of Snowflake," he said. "Snowflake is very easy to get started as a data warehouse. And that ease of use is the kind of that first hit, that drug-like experience, that gets Snowflake started within the enterprise. And then because Snowflakes pricing is so linked to data use, companies quickly find that as their data grows 50, 60, 70, or 100 percent per year. Their Snowflake bills increase just as quickly. Using Iceberg tables is going to be a way to cut some of those costs, but it comes at the price of losing the ease of use that Snowflake has provided."
Apache Iceberg surfaced in 2022 as a technology to watch to help solve problems in data integration, management and costs. Deniz Parmaksz, machine learning engineer with customer experience platform Insider, recently claimed it cut their Amazon S3 Cost by 90 percent.
While major players including Google, Snowflake, Databricks, Dremio and Cloudera have set out their stall on Iceberg, AWS and Azure have been more cautious. With Amazon Athena, the serverless analytics service, users can query Iceberg data. But Azure Ingestion from data storage systems that provide ACID functionality on top of regular Parquet format files such as Iceberg, Hudi, Delta Lake are not supported. Microsoft has been contacted for clarity on its approach. Nonetheless, in 2023, expect to see more news on the emerging data format which promises to shake up the burgeoning market for cloud data analytics.
Here is the original post:
Apache Iceberg promises to change the economics of cloud-based data analytics - The Register
- Nigeria's Okra joins cloud hosting race to challenge AWS and Azure - Developing Telecoms - October 10th, 2024
- US Signal Introduces IaaS Solution OpenCloud for Open-Source Cloud Hosting - The Fast Mode - October 1st, 2024
- Waite Park hosting Coffee with a Cop on Wednesday - St. Cloud Live - October 1st, 2024
- Internet Vikings Approved to Offer VMware Private Cloud Hosting in Arizona - Cision News - August 23rd, 2024
- We wanted to become the Rolls-Royce of cloud hosting: Inside Hyve Managed Hostings global expansion plans - ITPro - July 20th, 2024
- Hostinger Review: VPS, Cloud, and Shared Hosting - Tom's Hardware - July 12th, 2024
- Optimizing Web Performance with Cloud Hosting - Spiceworks News and Insights - June 26th, 2024
- Oracle to open third Spanish cloud region with Telefonica as hosting partner - Telecompaper EN - June 26th, 2024
- Interior awards $2 billion cloud hosting contract to 7 vendors - FedScoop - June 5th, 2024
- From Clean Energy to Cloud Hosting: Bitcoin Miners Have Diverse Operations - Finance Magnates - June 5th, 2024
- Top 10 Cloud Hosting Providers in 2024: Plans, Prices, and Key Factors - mitechnews.com - May 27th, 2024
- Bare Metal Cloud Market Grows with Demand for High-Performance Hosting Solutions As Revealed In New Report - WhaTech - May 19th, 2024
- Ahrefs Joins Others in Suggesting That On-Premises Hosting Can Be More Cost Effective than Cloud - InfoQ.com - May 19th, 2024
- St. Cloud's Rainbow Wellness Collective Hosting Series of Events - WJON News - March 20th, 2024
- Safe in the Cloud: A Deep Dive Into Hosting Security Measures - AppleMagazine - February 11th, 2024
- Why Peachtree Cloud Hosting Is The Future Of Streamlined Accounting - WhaTech Technology and Markets News - January 13th, 2024
- Unravelling The Secrets Of Sage 50 Cloud Hosting: Everything You Need To Know - WhaTech Technology and Markets News - January 13th, 2024
- Gift a Blogger, Student, or Professional a Lifetime of Cloud Web Hosting With iBrave, Now Only $40 - PCMag - December 26th, 2023
- These are the factors you need to take into account for Cloud hosting - TechiExpert.com - December 26th, 2023
- Andrew Lobel: Tech Luminary's Perspective On Cloud Hosting And AWS Lightsail's Prowess - Business Manchester - December 18th, 2023
- Hostereo revolutionizes cloud hosting with user-centric solutions, Powered by Interhost B.V. - NL Times - December 10th, 2023
- What Are The Advantages and Drawbacks of Cloud Hosting and ... - Analytics Insight - November 24th, 2023
- Cloud Computing Hosting Service Market 2031 Insights with Key Innovations Analysis | Leading Companies Acce... - SeeDance News - October 17th, 2023
- Multi Cloud Hosting and its Impact on Businesses - Digital Journal - April 26th, 2023
- What is cloud hosting and how do you use it? - TechRadar - April 26th, 2023
- How QuickBooks Hosting on the Cloud Server Helps Businesses in ... - Universe News Network - April 26th, 2023
- Mayor of St. Cloud Hosting State of the City Address - KVSC-FM News - April 18th, 2023
- Moro Hub join hands with Indias Cloud4C to offer cloud hosting ... - Arabian Business - March 25th, 2023
- Build unlimited sites with this $86 cloud-based web hosting - Cult of Mac - March 25th, 2023
- The role of cloud hosting in digital transformation and cloud computing - HostReview.com - March 9th, 2023
- [Webinar] Cloud Utility Pricing: Reduce Hosting Costs and Go Green ... - JD Supra - March 1st, 2023
- Features of Cloud Hosting Services Offered By Hosting Companies - HostReview.com - March 1st, 2023
- Cloud Hosting Contracts | Freedom of Information - Ordnance Survey - February 21st, 2023
- Cost Comparison of Cloud Hosting vs Traditional Hosting: What You ... - HostReview.com - February 13th, 2023
- Google hosting in-person Cloud Next 23 this August - 9to5Google - February 5th, 2023
- Agreement inked to provide cloud, hosting services at Musandam ... - Times of Oman - January 28th, 2023
- St. Cloud State Huskies have ended their losing streak after 0-2 vs ... - The Rink Live - January 28th, 2023
- Rise in Cyber Attacks Expected in 2023: Passwords and Cloud ... - TECH dot AFRICA - January 28th, 2023
- How to Find the Best Web Host for Your Business - The Yucatan Times - January 28th, 2023
- Here Are 2 Technology Stocks of the Future You Can Buy Today - The Motley Fool - January 28th, 2023
- What is PSaaS and is it Worthwhile? - Security Boulevard - January 28th, 2023
- Business was always a way of serving people - New Hampshire Business Review - January 28th, 2023
- Squire Patton Boggs assists in the acquisition of Sered - Iberian Lawyer - January 28th, 2023
- Whats Ahead for the Future of Data Streaming? - DevOps.com - January 28th, 2023
- How to create a new project in the self-hosted version of Orangescrum - TechRepublic - January 28th, 2023
- The Global Access Control as a Service (ACaaS) Market size is expected to reach $2.3 billion by 2028, rising at a market growth of 15.0% CAGR during... - January 28th, 2023
- Amazon wanted to discuss opportunities for fine-tuning NZs policy ... - New Zealand Herald - January 28th, 2023
- 3 Reasons Why Wall Street Analysts Think Amazon Stock Could ... - The Motley Fool - January 28th, 2023
- OnePlus Cloud 11 launch event: Heres everything OnePlus is launching in India on February 7 - Times Now - January 28th, 2023
- Auckland's giant new data centres - and the power they'll chug - New Zealand Herald - January 28th, 2023
- Octo Consulting Group, Inc. | U.S. - Government Accountability Office - January 28th, 2023
- The Venture Leaders Mobile 2023 kick off their roadshow to the ... - Venturelab - January 28th, 2023
- Demand for Server Virtualization Software Rises as Cloud and OS Technologies Proliferate: Fact.MR Exclusive Analysis - Yahoo Finance - January 20th, 2023
- Sabre CIO on the impact of cloud in travel - PhocusWire - January 20th, 2023
- cPanel Partners With CloudFest to Bring CloudFest USA Back to ... - InvestorsObserver - January 20th, 2023
- Basecamp details 'obscene' $3.2 million bill that caused it to quit the cloud - The Register - January 20th, 2023
- Microsoft set to make 5% of workforce redundant - Information Age - January 20th, 2023
- Who Owns the Generative AI Platform? - Andreessen Horowitz - January 20th, 2023
- 3 Warren Buffett Stocks That Could Soar 33% to 80% in 2023 ... - The Motley Fool - January 20th, 2023
- Earth Bogle: Campaigns Target the Middle East with Geopolitical ... - Trend Micro - January 20th, 2023
- Many businesses are set to spend big to raise their security game - TechRadar - January 20th, 2023
- Nvidia and 2 Other Stocks That Could Be Helped or Hurt by ChatGPT - Barron's - January 20th, 2023
- ESGold Welcomes Mr. Pierre-Olivier Mathys to its Advisory Board - TheNewswire.ca - January 20th, 2023
- How Has the Ramsar Convention Shaped China's Wetland ... - Sixth Tone - January 20th, 2023
- Chengdu Science Fiction Museum by Zaha Hadid Architects to host ... - Archilovers.com - January 20th, 2023
- Why I Bought This Promising Cloud Computing Stock - The Motley Fool - January 4th, 2023
- Brighton cloud company bringing 100 new skilled jobs to city - The Argus - January 4th, 2023
- MSP vs Vms: What Are the Differences? - StartupGuys.net - January 4th, 2023
- 5 Unstoppable Metaverse Stocks to Buy in 2023 - The Motley Fool - January 4th, 2023
- Top 10 Middle East IT stories of 2022 - ComputerWeekly.com - January 4th, 2023
- Potential cloud protests and maybe, finally, more JADC2 jointness ... - Breaking Defense - January 4th, 2023
- Double Down On Innovation With Edge Computing | - Spiceworks News and Insights - December 27th, 2022
- Simplifying digital sovereignty in a multi-cloud world - The Register - December 27th, 2022
- The Global IT Services Market size is expected to reach $2,013.6 billion by 2028, rising at a market growth of 8.4% CAGR during the forecast period -... - December 27th, 2022
- St. Cloud hockey games scheduled in honor of player killed in crash - SC Times - December 27th, 2022
- 2 Metaverse Stocks That Could Make You Richer in 2023 - The Motley Fool - December 27th, 2022
- EDNS inks a partnership deal with Alibaba Cloud to explore the ... - PR Newswire - December 27th, 2022
- Looking for a Surefire Winner in the Next Bull Market? Buy Amazon ... - The Motley Fool - December 27th, 2022
- Bank of England mulls future regulatory oversight over Ethereum ... - Ledger Insights - December 27th, 2022
- Year end note from Redington's key business heads - CRN ... - CRN.in - December 27th, 2022