OpenZFS launch announcement

Submitted by garyd 2013-09-17 General Development 54 Comments

ZFS is the world’s most advanced filesystem, in active development for over a decade. Recent development has continued in the open, and OpenZFS is the new formal name for this open community of developers, users, and companies improving, using, and building on ZFS. Founded by members of the Linux, FreeBSD, Mac OS X, and illumos communities, including Matt Ahrens, one of the two original authors of ZFS, the OpenZFS community brings together over a hundred software developers from these platforms.

ZFS plays a major role in Solaris, of course, but beyond that, has it found other major homes? In fact, now that we’re at it, how is Solaris doing anyway?

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

54 Comments

2013-09-17 11:31 pm
mattymoo
Solaris has become completely irrelevant.

2013-09-18 2:20 am
darknexus
Solaris has become completely irrelevant.
I wouldn’t say that. I doubt anyone would consider deploying new Solaris systems unless they’re already heavily invested in that infrastructure, but there are still a lot of Solaris servers in use for now. It’s too bad Sun destroyed it though, it had a lot of things going for it. Ah well, at least there’s FreeBSD which is just as stable and compatible and has a much more consistent userland.

2013-09-17 11:51 pm
mistersoft
and therefore ZFS has found favour in the FreeBSD based FreeNAS (and of course FreeBSD) for starters.
Personally – with a cheap 4 drive hardware raid enclosure and the sheer size of todays drives, I prefer the simplicity of Raid 10 for my own use
Edited 2013-09-17 23:52 UTC

2013-09-18 9:55 am
akro
RAID 10 is not sufficient for large sata drives. Urecoververable read errors are a fact of life and the ability to recover data from either a second set of parity information or from reconstruct is the a must to get by a URE…
RAID 6 provides this, raid 10 does not…
Tripple mirror is another option…

2013-09-18 3:00 pm
Alfman verbose=1
akro,
“Urecoververable read errors are a fact of life and the ability to recover data from either a second set of parity information or from reconstruct is the a must to get by a URE.”
IMHO RAID offers good protection for complete disk failures, but it’s ill suited for single occurrences of URE. If a single URE occurs, it doesn’t necessarily represent a physical error with the hard drive (power loss, EM radiation, solar flare, etc). RAID cannot easily differentiate between a faulty drive and a good drive containing bad data, so it rejects a drive containing 100%-1 good data because of a single unrecoverable read error. This puts the rest of the array at higher risk since it’s possible (if not likely) that another drive has another URE which is valid on the decommissioned drive.
“RAID 6 provides this, raid 10 does not…Tripple mirror is another option…”
We could theoretically bump it up to arbitrary levels of disk redundancy at the expense of efficiency. However I think it would make some sense to engineer a better solution that explicitly addresses the difference between URE and total drive loss.
My own solution would be to have network distributed redundancy rather than adding more redundancy through RAID. I’ve been working on something similar for a while for my own project.

2013-09-18 12:21 am
Drumhellar
ZFS + FreeBSD make a great combo, especially since FreeBSD supports it out of the box – no patches or extra packages needed.
It is well worth the extra effort required to use ZFS as root – it’s noticeably faster than UFS2. This is almost immediately apparent when you unpack the ports tree for the first time.
There’s a couple great guides on how to do that, too:
https://wiki.freebsd.org/RootOnZFS
http://www.aisecure.net/2012/01/16/rootzfs/
The first one is my preferred one, since it covers more usage scenarios, and stays updated with new releases.
Apple should have put more effort into ZFS, since HFS+ is pretty lousy. Contrary to popular belief, you don’t need gobs of extra memory for ZFS to be beneficial. It is tunable for lower-memory systems (Say, a 4GB MacBook Air). It just generally defaults for systems with lots of memory, because it’s mostly used where heavy caching might be useful.
If you don’t have lots of extra memory to dedicate to just the filesystem, it’s still fast, reliable, and flexible.
Edited 2013-09-18 00:27 UTC

2013-09-18 4:16 pm
sforstall1983
OS/4 OpenLinux, http://www.os4online.com, like FreeBSD supports ZFS out of box. It will be interesting to see if this gets more Linux developers and distributors behind ZFS.

2013-09-18 6:40 pm
Drumhellar
Isn’t it a GPL violation to distribute non-GPL kernel modules with a GPL’d kernel? I thought it was generally accepted that using internal kernel APIs is considered a “derivative work,” which is why almost every distro doesn’t ship with ZFS out of the box.
Or, does it use Fuse? If that’s the case, it’s unfortunate, since everything I’ve been able to find on the Fuse/ZFS (which isn’t much) indicates it’s not production ready.
I remember seeing mention of a GPL-licensed clean-room implementation of ZFS, but I’m pretty sure that has gone nowhere, and it would be very patent-encumbered.
Either way, I wouldn’t want to rely on a feature that might get pulled from the next version due to legal issues, especially a file system driver.

2013-09-18 12:23 am
pfgbsd
ZFS has become very popular on FreeBSD where it is maintained practically in sync with Illumos.
OpenZFS is great news, I hope the find a way to merge the developments from all ports.
2013-09-18 5:52 am
novad
Well, the greatest strength of ZFS is its embedded ability to make and manage snapshots.
Many SAN/NAS solutions use it due to its flexibility and strength (FreeNAS (Allready mentioned in another comment) or Nexenta (More professional))
At home I use FreeNas for this specific ability and we have enabeled a stretched “metro Cluster” based on Nexenta at work. It’s really sexy
edit : Typo
Edited 2013-09-18 06:04 UTC
2013-09-18 6:40 am
Luminair
> how is Solaris doing anyway?
There are a lot of small-time operations working on Solaris-derived products, and said products are growing and growing in size and quality. Solaris is quite dead, but the code lives on
2013-09-18 7:05 am
evert
Joyent uses Solaris technology, although they modified it, added KVM virtualization, and called it SmartOS.
http://www.joyent.com/technology/smartos
2013-09-18 8:52 am
porcel
It sounds interesting, but I doubt anything much will change since the license remains the same: CDDL.
This has been debated before, so there is no point in rehashing the arguments, but the truth of the matter is that the impact of ZFS will remain minimal until it can be safely integrated into all major distributions at installation time.
In other words, until major Linux distributions support it on a root install and will honor support contracts where ZFS is used, I doubt it will gain much traction, among other things, because right now ext4 is “good enough” for lots of people and very safe.

2013-09-18 3:52 pm
pfgbsd
It sounds interesting, but I doubt anything much will change since the license remains the same: CDDL.
The code already has suficient third party contributions that Oracle cannot unilaterally change the license even if they wanted to.
OTOH, if the code could be made GPL that would be great for linux but bad for the filesystem: FreeBSD and probably even illumos would have to fork it for license reasons and the code in the linux tree would rapidly develop linuxisms that would make a central vendor-independent distribution very difficult anyways.
This said, there is nothing legally stopping Linus from importing CDDL’d code, it’s just the NIH sindrome.

2013-09-18 6:24 pm
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!)
The code already has suficient third party contributions that Oracle cannot unilaterally change the license even if they wanted to.
OTOH, if the code could be made GPL that would be great for linux but bad for the filesystem: FreeBSD and probably even illumos would have to fork it for license reasons and the code in the linux tree would rapidly develop linuxisms that would make a central vendor-independent distribution very difficult anyways.
This said, there is nothing legally stopping Linus from importing CDDL’d code, it’s just the NIH sindrome.
Got a citation for that? According to Wikipedia, Danese Cooper said at DebConf 2006 that the CDDL was specifically designed to be GPL-incompatible because the Solaris devs didn’t want their best work to just get gobbled up by Linux when OpenSolaris was released.
Edited 2013-09-18 18:25 UTC

2013-09-18 7:15 pm
pfgbsd
This said, there is nothing legally stopping Linus from importing CDDL’d code, it’s just the NIH sindrome.
Got a citation for that? According to Wikipedia, Danese Cooper said at DebConf 2006 that the CDDL was specifically designed to be GPL-incompatible because the Solaris devs didn’t want their best work to just get gobbled up by Linux when OpenSolaris was released.
IANAL but I don’t trust Wikipedia on legal stuff either. The subject of mixing GPL and CDDL comes up somewhat regularly on the illumos lists, and it may surprise you but there is already code that does just that: look at cdrecord.
It would also not be the first time that linux brings in some less restricedly licensed code.

2013-09-18 11:51 pm
Drumhellar
The subject of mixing GPL and CDDL comes up somewhat regularly on the illumos lists, and it may surprise you but there is already code that does just that: look at cdrecord.
Except that Debian, Fedora, and others, have ditched cdrecord in favor of either cdrkit or libburn over concerns of mixing CDDL and GPL.
2013-09-19 2:47 am
pfgbsd
The subject of mixing GPL and CDDL comes up somewhat regularly on the illumos lists, and it may surprise you but there is already code that does just that: look at cdrecord.
Except that Debian, Fedora, and others, have ditched cdrecord in favor of either cdrkit or libburn over concerns of mixing CDDL and GPL.
FUD. The author is not only *not* being sued,, but he claims he received an OK from a FSF lawyer.
There is also the case of SmartOS, which is making money with KVM + iillumos. Bryan Cantrill even gave a talk about it to the KVM devs.
2013-09-19 2:55 am
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!)
FUD. The author is not only *not* being sued,, but he claims he received an OK from a FSF lawyer.
There is also the case of SmartOS, which is making money with KVM + iillumos. Bryan Cantrill even gave a talk about it to the KVM devs.
Lawsuits are expensive and don’t magically appear as soon as someone violates terms so nobody wants to take that risk of accepting what may be poison for a future “Sun’s code in Linux” lawsuit just because JÃ¶rg Schilling didn’t get sued yet.
In fact, as I remember, the main issue is that, by mixing GPL and CDDL, he’s doing the legal equivalent of saying “I grant you permission to use this code as long as X == Y” where there’s a high probability that Y = X+1.
(It’s not his problem that he’s requiring people to simultaneously follow two mutually-exclusive sets of rules… it just means that any prospective users are effectively operating under “no permission to use granted” conditions.)
You have to plan for the least favourable interpretation of the legalese, not the most. (Unless you’re a huge company with tons of expensive lawyers. Then you can relax that rule.)
As for the OK from the FSF lawyer, do you have a first-hand citation for that? Wikipedia or not, I trust my citation more than your or Mr. Schilling’s word of mouth.
Edited 2013-09-19 03:05 UTC
2013-09-19 9:56 am
pfgbsd
You have to plan for the least favourable interpretation of the legalese, not the most. (Unless you’re a huge company with tons of expensive lawyers. Then you can relax that rule.)
As for the OK from the FSF lawyer, do you have a first-hand citation for that? Wikipedia or not, I trust my citation more than your or Mr. Schilling’s word of mouth.
I have no interest in challenging Mr Schilling. Interested parties can always contact the SFLC, which, I have heard but don’t believe, is full of lawsuit-hungry lawyers.
2013-09-19 5:32 pm
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!)
I have no interest in challenging Mr Schilling. Interested parties can always contact the SFLC, which, I have heard but don’t believe, is full of lawsuit-hungry lawyers.
I have no interest in sueing Mr. Schilling… I just fully understand and agree with the principle of not incorporating code into your project if it engages in debatable licensing practices.
(I don’t care how safe it is if I might be the one having to engage in a costly legal battle to set the precedent proving that safety.)
Edited 2013-09-19 17:33 UTC
2013-09-19 4:59 am
Drumhellar
Regarding SmartOS’s use of KVM with the Illumos kernel, since the KVM modules would be linked against the kernel, and more specifically, linked against a public and well documented external API that is provided specifically for third parties to build and distribute kernel modules and drivers (Since Illumos has presumably kept the stable driver interfaces of Solaris), wouldn’t it be a non-violation to port GPL kernels to Illumos? It’d be hard to say Illumos is a derived work of KVM if KVM links against public interfaces.
2013-09-19 9:48 am
pfgbsd
Regarding SmartOS’s use of KVM with the Illumos kernel, since the KVM modules would be linked against the kernel, and more specifically, linked against a public and well documented external API that is provided specifically for third parties to build and distribute kernel modules and drivers (Since Illumos has presumably kept the stable driver interfaces of Solaris), wouldn’t it be a non-violation to port GPL kernels to Illumos? It’d be hard to say Illumos is a derived work of KVM if KVM links against public interfaces.
I am pretty sure the port is/was very invasive. It is not really a matter of KVM being a derivative work of Linux or not, KVM itself is GPLd so linking it to a non-free kernel would be a license violation.
OTOH, ZFS is in no way derivative work of the Linux kernel as it originated in another OS and, as you say, the porters are just linking against public interfaces.
2013-09-19 6:11 pm
Drumhellar
KVM for Illumos module doesn’t use the deep hooks that the Linux system has (Which are marked as GPL-only)
The port didn’t mod the Illumos kernel at all – all the changes live in the KVM module, which is also maintained separately from the Illumos kernel.
From an excellent article on it:
http://lwn.net/Articles/459754/
Of course some questions arose about the license: Joyent has copied the GPL-ed KVM code from Linux, while the illumos kernel uses the CDDL (Common Development and Distribution License). However, according to Cantrill this doesn’t pose any problems. On his blog he answers a question from a reader about the issue:
Our KVM port remains GPL and its own work (and lives in its own repo) – the illumos kernel is CDDL but is in no way a derived work of our KVM port.
And on Hacker News he clarifies that their KVM port doesn’t use the hooks that Linux KVM has into the Linux kernel (which are marked as EXPORT_SYMBOL_GPL in the Linux kernel): “Actually, our port does not use these hooks â€“ there were zero mods to the illumos kernel to support KVM per se.” So, although there seem to be some questions about the legality of the KVM module in illumos, the developers are fairly confident that the problems don’t apply because the illumos kernel (CDDL-licensed) is not a derived work of the illumos KVM module (GPL-licensed).
You clearly don’t know what you’re talking about. Linux has a mix of public and private interfaces – private interfaces are marked as GPL only, Linus specifically says that linking against the public system call interfaces doesn’t make your user space software a derived work of the kernel – It’s not an exception, just a statement of fact.
Public and stable kernel module interfaces (Which Illumos is full of) would clearly fall under the “System Library” exception of the GPL. Either way, it’s generally okay to write GPL code and link it against a non-free library. (see http://www.gnu.org/licenses/gpl-faq.html#FSWithNFLibs)
The FSF even grudgingly admits that.
2013-09-19 6:36 pm
pfgbsd
KVM for Illumos module doesn’t use the deep hooks that the Linux system has (Which are marked as GPL-only)
…
You clearly don’t know what you’re talking about. Linux has a mix of public and private interfaces – private interfaces are marked as GPL only, Linus specifically says that linking against the public system call interfaces doesn’t make your user space software a derived work of the kernel – It’s not an exception, just a statement of fact.
TBH, I honestly don’t care what linux consider private or public; the license doesn’t mention any different between them so such considerations usually have little legal weight.
The more interesting question would be how this applies to ZFS. The ZFS on Linux site carries an informative link in their FAQ:
http://zfsonlinux.org/faq.html#WhatAboutTheLicensingIssue
The combination of them causes problems because it prevents using pieces of code exclusively available under one license with pieces of code exclusively available under the other in the same binary. In the case of the kernel, this prevents us from distributing ZFS as part of the kernel binary. However, there is nothing in either license that prevents distributing it in the form of a binary module or in the form of source code.
2013-09-19 6:50 pm
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!)
TBH, I honestly don’t care what linux consider private or public; the license doesn’t mention any different between them so such considerations usually have little legal weight.
The difference carries a ton of weight.
Without it, it would be impossible to run GPL/CDDL software on a kernel under another license or to use a GPL/CDDL kernel to run proprietary software.
When you link against a private interface, the normal rules apply.
When you link against a public interface, you’re using the legal rules for making a system call from userland.
That’s what syscalls are. They’re function calls to public interfaces in the kernel, wrapped in some extra libc and kernel machinery to handle things like the transition between user mode and kernel mode.
Without that difference, calling fopen() could be enough to require that the licenses for the kernel and software run on it are mutually compatible.
Edited 2013-09-19 18:55 UTC
2013-09-19 8:16 pm
pfgbsd
TBH, I honestly don’t care what linux consider private or public; the license doesn’t mention any different between them so such considerations usually have little legal weight.
The difference carries a ton of weight.
Nah, very, very little.
http://www.networkworld.com/news/2006/120806-closed-modules2.html
… copyright offers the weakest of protections: it covers only narrowly defined expressions.
The whole article is worth a read, and it covers the issue in much deeper depth that I can explain here.
2013-09-20 7:42 am
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!)
Nah, very, very little.
http://www.networkworld.com/news/2006/120806-closed-modules2.html
The whole article is worth a read, and it covers the issue in much deeper depth that I can explain here.
At some point, I definitely will. It looks interesting.
From what I read there, my main comment is that no single court has power to set worldwide precedent, so, as I said before, “err on the side of caution” tends to be the rule of the day.
2013-09-19 12:55 pm
bitwelder
IANAL but I don’t trust Wikipedia on legal stuff either
Too bad for you that Wikipedia provides a link to an Ogg Theora recording, where you can actually *listen* Cooper stating what mentioned above.
2013-09-19 4:37 pm
pfgbsd
IANAL but I don’t trust Wikipedia on legal stuff either
Too bad for you that Wikipedia provides a link to an Ogg Theora recording, where you can actually *listen* Cooper stating what mentioned above.
And in the illumos homepage there is a talk by Bryan Cantrill, who co-authored the CDDL, saying what Denise Cooper said is not true.

2013-09-18 8:56 am
arsa
I work for the Oracle (Sun) partner company. It’s the only such partner in my country, but it’s been that way for 20 years now. So, our influence is not wide – but where we exists, it’s crticial and crucial.
The development of Solaris goes on well. Important software for it is found everywhere. For example, big vendors form other fields (say, large scale networking devices) are producing software to control their equipment and that software still comes in Solaris versions.
From the field, I can tell you the sales of Sun hardware and software has been tremendeously cut down once Oracle took over. We had very tough 24 months but it’s not like that anymore. Our customers continue buying Oracle servers (not as much as they bought Sun servers) and thus employing new versions of Solaris. Heck, my team is being pushed to have every man certified for Solaris 11.1 as soon as possible.
So, saying Solaris is “drop dead” cannot be accepted by me as a person who earns his salary through Solaris expertise. However, I’m not the one to say Solaris will never die – the fully libre GNU/Linux will overtake everything (while not-so-libre Java/Linux will not).
I’d say, in 10 years GNU/Linux will be the only significant operating system in the world and software engineers will be able to move from Operating Systems onto other fields in need of free spirit (proprietary VoIP protocols?). Until then, Solaris will probably do well.
P.S. I earn my money from GNU/Linux too 😉

2013-09-18 4:07 pm
lucas_maximus
I think people forget that just because a OS or technology isn’t in the news doesn’t mean that suddenly nobody is using it.
I still see VB6 apps being maintained (and the code bases are that bad tbh).

2013-09-18 9:08 am
dnebdal
It’s not just that FreeBSD has become a good way to get ZFS, but also that ZFS has become a core FreeBSD feature.
Practically speaking, that means that tools like their new package building system (poudriere) can use (and used to require) ZFS and snapshots to quickly create and destroy the clean environments the packages are built in. It also means that they have de-prioritized their previous RAID solutions, since ZFS is a better way to do almost anything where you pool a lot of disks … at least unless you’re memory-limited.
Edited 2013-09-18 09:10 UTC
2013-09-18 2:19 pm
p13.
This isn’t the first time i’ve heard this or something similar. AFAIK, the only thing keeping ZFS from being incorporated into the kernel is the license. So, until they find a way to relicense it, or reverse engineer the whole thing in such a way that they can release it under GPL … i just can’t see it happening.
2013-09-18 3:46 pm
saso
In fact, now that we’re at it, how is Solaris doing anyway?
Strange, I could swear I was reading a site called OSNews, not “Thom’s blog”. I suppose researching stories before posting them is so old school…
2013-09-18 4:11 pm
peteo
fat12 ftw
2013-09-18 5:33 pm
jessesmith
ZFS is quite popular in the FreeBSD/FreeNAS/PC-BSD communities. In fact, several PC-BSD features are tired to ZFS, such as operating system snapshots and jail snapshots. For work stations and servers ZFS gives PC-BSD/TrueOS a big boost over many competing operating systems. I’d also like to point out ZFS is stable on Linux. in fact, I’ve been running ZFS on Linux servers for over a year and it’s been great. I love having the instant snapshots and the ZFS code is more stable than Btrfs at the moment.
2013-09-19 10:51 am
MrVain
Sun had 30.000 customers. Oracle has 340.000 customers. Oracle is betting heavily on Solaris, for high end large >32 socket servers. Larry Ellison said that “Linux is for low end, Solaris for high end”. Oracle owns the whole stack now: hardware (SPARC), OS (Solaris), middleware (Java), applications (database). So Oracle is adding things into the stack that boosts the Oracle database, for instance Solaris is 17% faster than other OSes running Oracle Database. The Database is the heart of every company, without databases you can not run a company. The most important piece at a company is the database. Oracle owns the database market (80% or so).
So, Oracle is now boosting Oracle Database throughout the whole stack. The results shows in latest TPC-C and TPC-H database benchmarks: Oracle is much much much faster than anyone else, including IBM POWER7, Intel Xeon, HP, etc. The fastest database servers in the world are from Oracle, here are several database benchmarks:
https://blogs.oracle.com/BestPerf/entry/21030612_sparc_t5_4_tpc
So, companies are not really interested in a certain technology, instead they are interested in solutions. If Oracle has a Database solution (running Solaris on SPARC) that is much much much faster than anyone else, the companies will buy it. Management dont care about tech talk. They want to see fast results. Oracle has 10x performance/price advantage over IBM POWER7, for instance. Just check the official benchmarks.
Also, Oracle is targeting large servers, serving many clients with extreme throughput. Which means Oracle is also fastest at Java server work loads, SAP, etc. All types of server work loads. Just read the link above.
Why said Larry Ellison that “Linux is for low end”? Why not use Linux for high end as well? Well, the largest Linux servers today, are 8-socket servers. There have never existed larger Linux servers. You can not find a 16 socket Linux server, for instance. No one has ever sold a 16 cpu Linux server. How can Linux kernel developers improve scalability on Linux servers, when the Linux hardware does not exist? How can they test their code? The largest Linux servers they can test their code on, is a common 8-socket x86 server from Dell, HP, IBM, etc.
Sure, there are SGI UV1000 servers with 10.000s of cores and loads of RAM, but it is a cluster consisting of several PCs on a fast switch. Also, the ScaleMP Linux server with 1000s of cores, is also a cluster. So, all Linux servers with 1000s of cores are clusters. They are all NUMA servers, and NUMA servers are clusters:
http://en.wikipedia.org/wiki/Non-uniform_memory_access#NUMA_vs._clu…
“…One can view NUMA as a tightly coupled form of cluster computing….”
NUMA servers are primarily used for HPC number crunching, and can not run databases in large configurations, the database performance would be extremely bad. So no Enterprise companies use NUMA servers, because they are primarily interested in databases. Only rendering farms use NUMA servers from SGI or ScaleMP to do number crunching.
Sure, you can run a distributed database over all nodes in a HPC cluster – but you can not run a monolithic database such as Oracle on a HPC cluster. So if you ever see a database benhcmark on large NUMA clusters it must be a distributed database, Uridium, Virtual Void. You would never see a Oracle database running on SGI UV1000 cluster, for instance.
If you look at a Solaris or IBM AIX server with 32 sockets, it is built as a SMP server. SMP servers are a “single, huge server”, not a HPC cluster. The difference between a HPC cluster and SMP server, is the RAM latency. The HPC cluster has >10.000ns in worst case RAM latency, because data cells might be far away in another node – so you need to make sure your data are close in adjacent nodes, etc.
SMP servers have worst case latency of a few 100ns, so you program clusters totally different from a true SMP server. An SMP server is just an ordinary server, and you dont need to redesign your software, just program as normal. Just copy your normal binaries to the SMP server and run it. If you try to copy normal binaries to a HPC cluster, it will not work because you need to redesign software so data is close to adjacent nodes, etc – otherwise performance will be very bad.
You can not run normal software on a HPC cluster, it can only run a certain type of software with embarassingly parallel workloads. The big difference is RAM latency. Clusters have very bad worst case latency, an SMP server has good worst case latency.
The new Oracle M6 server next year, has 96 sockets, with 9.216 threads and 96TB RAM. It is not a true SMP server, it shares some NUMA features – but worst case latency is very very good. So you treat it simply like a true SMP server, no need to treat it as a HPC cluster. It is designed to run Oracle database in huge configurations, all from RAM. So SAP HANA memory database will not be a threat, Larry said.
(All 32 socket Unix servers share some NUMA features, but they have very good RAM latency, so you treat them all as a true SMP server).
So, for high end we will continue to find IBM and Oracle and HP with 32 socket servers. For low end with 1-8 socket servers there will be Linux and Windows. Until Linux handles larger than 8-socket servers, Linux will never venture into high end enterprise. So Linux scalability continues to be a myth.

2013-09-20 8:28 am
arsa
You seem to have some insight into Oracle businees plans which is not just a mere forecast. That’s fine and well. What is not fine and well is stating “GNU/Linux is for low-end only” and “GNU/Linux scalability is a myth”.
Oracle presents its Engineered Systems as the best machine they have. When I remember the way Oracle is marketing this machine (OpenWorld keynotes, “Oracle Magazine” and, oh dear, trailers for IronMan movies) it’s clear they consider it a big deal. Description certainly seem impressive: http://www.oracle.com/us/products/engineered-systems/index.html
Should we also describe what operating systems they run on?
– Exadata, Database appliance, Exalytics, Big data appliance, Network applications platform run on Oracle Linux
– Exalogic can either run on Oracle Linux or Solaris
– Supercluster and ZFS storage applicance run on Solaris
– Virtual compute applicance is running Oracle VM but I don’t know much about it
It appears that if it weren’t for the facts that SPARC can be properly used in Solaris only and ZFS storage OS was already nicely built, then GNU/Linux is the only choice.
Now, the question for you. You are certainly aware of the expandibility of Engineered Systems. Can you explain how come GNU/Linux is not scalable? Were you able to deduce that from Oracle’s behaviour?
Once again, I am all for the success of Solaris, but I’m even more against false claims over GNU/Linux.
All the best.
Edited 2013-09-20 08:30 UTC

2013-09-20 11:58 am
MrVain
You seem to have some insight into Oracle businees plans which is not just a mere forecast. That’s fine and well. What is not fine and well is stating “GNU/Linux is for low-end only” and “GNU/Linux scalability is a myth”.
Larry Ellison said that Linux is for lowend in an official interview. Read here for Oracle’s official stand on Linux vs Solaris:
http://searchenterpriselinux.techtarget.com/news/1389769/Linux-is-n…
Oracle presents its Engineered Systems as the best machine they have. When I remember the way Oracle is marketing this machine … Now, the question for you. You are certainly aware of the expandibility of Engineered Systems. Can you explain how come GNU/Linux is not scalable? Were you able to deduce that from Oracle’s behaviour?
Oracle’s engineered systems are really good, and they surely run Linux and/or Solaris. However, these systems are small. For instance the ExaData has only a few cpus. The ExaLogic is a cluster:
http://en.wikipedia.org/wiki/Oracle_Exalogic
And as well all know, Linux scales excellent in clusters. No one has ever denied that.
The question is if Linux scales on a single fat server. And Linux does not scale on such a SMP server, because there have never existed such a Linux server with 16 or 32 cpus for sale. There are Linux clusters for sale, with 1000s of cores, but no 16/32 cpu Linux server for sale. So, if there does not exist any 32 socket Linux servers, how can Linux scale well on SMP servers? They dont even exist, how can anyone even benchmark and assess the Linux scalability?
So, the Oracle engineered systems running Linux are either tiny (up to 8 sockets) or they are a cluster. The Oracle Engineered systems also exist in a modified Solaris version, and they are called “Supercluster”. The engineered systems were developed before Oracle bought Sun, and that is the main reason they mostly run Linux. “Supercluster the fastest engineered system at Oracle”:
http://www.oracle.com/us/products/servers-storage/servers/sparc/sup…
The new Oracle M6 server will have more than 32 sockets, it will have 96 sockets. And it will run Solaris, not Linux. Because Linux can not scale beyond 8-sockets in a single fat server (SMP).
Once again, I am all for the success of Solaris, but I’m even more against false claims over GNU/Linux.
All the best.
These are not false claims. There have never existed a single fat Linux server with more than 8 cpus for sale. The largest Linux server ever sold, has 8 cpus, and is just a normal HP, DELL, IBM server x86. I suggest you google for larger Linux servers, you will not find any. Thus, I speak true. The proof I speak true is simple: no one has ever sold larger Linux servers than 8 cpus. If you find a larger Linux server, then I am wrong. You will not even find benchmarks on a large Linux server. It is like IBM, they never release benchmarks on their large IBM Mainframes, because Mainframes have really slow cpus, much slower than x86 cpus. The IBM POWER7 is a good cpu, though.
Linux scales excellent in clusters, up to 10.000s of cores and maybe even beyond that.
Linux scales very bad in SMP servers, up to 8 sockets. HP tried to compile Linux to their big 64-socket Unix server, and Linux failed miserably. Linux had ~40% cpu utilization, which means most cpus were idle when running on a 64 cpu server. That is quite bad. Google for “HP Big Tux Linux server” for more information.

2013-09-20 2:56 pm
Alfman verbose=1
Kebabbert,
“Larry Ellison said that Linux is for lowend in an official interview. Read here for Oracle’s official stand on Linux vs Solaris:”
I was unable to find the interview in your link (membership required?) Of course he would say that though, he’s biased. Bill Gates would have said the same thing regarding windows. They are not really going to admit that their products can be replaced by much cheaper linux alternatives, even if it were true. I hope you understand why asking Larry Ellison is meaningless. Consider his claim here:
http://www.x-drivers.com/news/hardware/4900.html
“Oracle claimed a legion of record-breaking benchmark performances for the T4-4. Ellison repeatedly compared the performance of the T4-based Sparc SuperCluster to IBM’s Power lineâ€”and the Power 795 in particular. A one-rack T4 SuperCluster ‘is twice as fast as IBM’s fastest computer, at half the cost,’ he claimed.”
When in fact… “But the benchmarks that Oracle cited were mostly internal ones. Those may carry some weight for many Oracle customers, but there were only two that really hint at the T4-4’s performance beyond software that has been tuned for that processor. One of those third-party benchmarks was the TPC-H benchmark for a 1,000 GB load, in which the T4-4 beat the IBM Power 780 and Itanium-based HP Superdome 2 on price/performance, raw performance, and throughput.”
and
“The T4 is still outperformed on Oracle Database 11g by HP’s BladeSystem RAC configuration running Oracle Linux, and edged out by HP’s Proliant DL980 G7 running Microsoft SQL Server 2008 and Windows Server 2008 both on price performance and raw power. Both are x86 systems.”
I’m not sure whether the additional cores with Sparc are actually beneficial. I just wasn’t able to find much information about it.
“The question is if Linux scales on a single fat server. And Linux does not scale on such a SMP server, because there have never existed such a Linux server with 16 or 32 cpus for sale. There are Linux clusters for sale, with 1000s of cores, but no 16/32 cpu Linux server for sale. So, if there does not exist any 32 socket Linux servers, how can Linux scale well on SMP servers? They dont even exist, how can anyone even benchmark and assess the Linux scalability? ”
Do you know if it’s *really* a linux problem instead of an x86 SMP scalability problem? I honestly don’t think x86 can scale efficiently beyond 8 cores under any OS. My understanding is that linux will run on the same Sparc architectures that Solaris does:
http://www.phoronix.com/scan.php?page=news_item&px=MTE5Nzc
Do you have a benchmark of an apples to apples comparison between solaris and linux on the same processors (ignoring that such processors are not being sold with linux)?
Mind you solaris *could* be better than linux for high end deployments. I’m genuinely curious about it, and if you have any evidence (benchmarks & case studies) that would be very informative to me.
For that matter, I’m very curious about the scalability of 64 core shared memory systems in general regardless of OS. Correct me if I’m wrong, but it seems to me that it would scale badly unless it were NUMA (or it had so much cache that it could effectively be used as NUMA).
It’s fun to talk to others who are passionate about this stuff!
Edited 2013-09-20 15:02 UTC

2013-09-20 3:33 pm
Alfman verbose=1
What I meant to quote before was this, but I was too hasty.
“some of the claims Oracle made about its performance on third-party benchmarks were based on selected interpretations of the dataâ€”which drew catcalls from IBM Systems and Technology chief technical strategist Elisabeth Stalh. “Oracle claimed nine T4 world records. 7 of the 9 are not industry standard benchmarks but Oracleâ€™s own benchmarks, most based on internal testing,” Stahl blogged. “Oracleâ€™s SPECjEnterprise2010 Java T4 benchmark result, which was highlighted, needed four times the number of app nodes, twice the number of cores, almost four times the amount of memory and significantly more storage than the IBM POWER7 result.”
It does suggest that Larry Elisson’s claims should be taken with a grain of salt. Although in fairness to him, everyone else probably does the same thing.
2013-09-21 10:43 am
MrVain
“The T4 is still outperformed on Oracle Database 11g by HP’s BladeSystem RAC configuration running Oracle Linux, and edged out by HP’s Proliant DL980 G7 running Microsoft SQL Server 2008 and Windows Server 2008 both on price performance and raw power. Both are x86 systems.”
The SPARC T4 has a maximum of 4 sockets, there are no larger T4 servers. And as such, is a tiny server. The HP bladesystem is a RAC cluster, consisting of several PCs. It is easier to beat a single server if you are using clusters. The Windows SQL Server 2008 is a small server. Sure it competes with SPARC T4 on 4-socket servers, but the difference is that Windows SQL is topped out, whereas Oracle Database can continue to scale even to very large servers. Just because you can compete on 1-4 sockets, does not mean you can keep compete when you go above 8 sockets. I am convinced Windows SQL has scaling problems and you will see steep deterioration above 4 sockets or so. But on 1-4 sockets I think Windows SQL does excellent, because MS has optimized and tested the code for few sockets. It is difficult to scale to many sockets and keep the performance intact. There is a scaling problem.
But I agree that the SPARC T4 was not that good on all benchmarks because it stopped at four sockets. T4 is more of a server cpu, built to handle many threads and many users at a high throughput. A desktop cpu has few strong threads, so it can not service many many clients, only few of them at a time. On databases though, the T4 was very good. Oracle is a database company, so that is to be expected.
Regarding “Oracles internal benchmarks” said the IBM spokesperson. Sun Microsystems SPARC cpus had several world records running Oracle software, long before Oracle bought Sun. Oracle is an important enterprise software provider, so IBM and Sun and many others benchmarked Oracles software, “Peoplesoft”, etc. And SPARC was always much faster than IBM POWER, because SPARC is a server cpu handling many clients. For instance, you needed 14 (fourteen) IBM POWER6 cpus to match 4 SPARC T2 cpus in Peoplesoft v8.0 benchmarks back then (Google “Peoplesoft v8 benchmarks” and compare IBM to Sun). SPARC T2 crushed IBM back then, in many server benchmarks which all favours high throughput. Some years later, Oracle bought Sun, and guess what? SPARC still continued to crush IBM on the same Oracle benchmarks, Peoplesoft, etc. But because IBM still could not beat SPARC in high throughput benchmarks using Oracle software, IBM changed strategy and said “no, we will not compete on Oracle benchmarks any more”. But SPARC has always crushed IBM POWER on those benchmarks, even long before Oracle owned SPARC. Ergo, “internal benchmarks” is just FUD from IBM.
You should know that IBM is FUDing a lot and according to wikipedia, it was actually IBM that started to employ FUD on a wide systematic scale:
http://en.wikipedia.org/wiki/Fear,_uncertainty_and_doubt#Definition
The latest SPARC T5 cpu is twice as fast as the T4. And the SPARC T5 has twice the number of sockets, up to 8. So T5 servers are four times faster. And very good at databases, and also has the world record in many server benchmarks. It is almost three times as fast as IBM POWER7 on some database benchmarks. T5 utterly crushes IBM POWER7, IBM can not compete anymore.
And guess what IBM is saying in response to the worlds fastest cpu T5? “We dont understand why Oracle is talking about how fast cpus they have, no one is interested in cpu performance any longer. Talking about cpu performance is so 2000ish”. I am convinced that if the coming IBM POWER8 is good, IBM will start to brag a lot how fast the POWER8 is.
http://blogs.wsj.com/digits/2013/03/27/ibm-fires-back-at-oracle-aft…
â€œ…[performance] was a frozen-in-time discussion,â€ Parris said in an interview Wednesday. â€œIt was like 2002â€“not at all in tune with the market today.â€…Companies today, Parris argued, have different priorities than the raw speed of chips. They are much more concerned about issues like â€œavailabilityâ€â€“resistance to break-downsâ€“and security and cost-effective utilization of servers than the kinds of performance numbers Ellison throws out…”
My point is that when IBM can not compete, IBM will start to FUD. As IBM always have done.
For instance, the largest IBM Mainframe with 24 sockets, is claimed to replace up to 1.500 x86 servers. But, if you happen to know that IBM Mainframe cpus at 5.26GHz are much slower than a decent x86 cpu, you will start to wonder. If you dig a bit, it turns out that all x86 servers are idling, and the Mainframe is 100% loaded! How can a Mainframe cpu replace even a single x86 cpu?? Let alone 1.500 x86 cpus! My point is that you should be cautious what famous FUDing company IBM says. The largest IBM Mainframe has 50.000 MIPS, which corresponds to 200.000 MHz x86 (see link below). An 10-core x86 cpu at 2GHz equals 20.000MHz. How many x86 cpus are needed to reach 50.000 MIPS? So, are the IBM Mainframes so fast as IBM claims, or is it only FUD?
http://www.mail-archive.com/linux-390@vm.marist.edu/msg18587.html
Another example of IBM FUD is this. The POWER7 has few but strong cores. The SPARC T5 has many but weaker cores. IBM says: POWER7 core is faster – this is true in some benchmarks. Therefore the POWER7 cpu is faster – this is false. Just because IBM has stronger cores, it does not make the entire cpu stronger. Imagine IBM has only one strong core, and SPARC has 1000 slightly weaker cores, which cpu is fastest you recon?
IBM claims that Oracle SPARC T5 world records are false, that SPARC T5 is not the worlds fastest cpu. Oracle is lying says IBM. Nothing is correct, says IBM:
http://whywebsphere.com/2013/04/29/weblogic-12c-on-oracle-sparc-t5-…
“….Oracle announced their new SPARC T5 processor with much fanfare and claiming it to be the â€œfastest processor in the worldâ€. Well, perhaps it is the fastest processor that Oracle has produced, but certainly not the fastest in the world. You see, when you publish industry benchmarks, people may actually compare your results to other vendorâ€™s results. This is exactly what I would like to do in this article.
…Being â€œfastest processor in the worldâ€ means that such processor must be able to handle the most transactions per second per processor core
…However IBM produced the world record result in terms of EjOPS per processor core â€“ truly a measure of the fastest processor known to men…”
Here IBM claims that because IBM has stronger cores and therefore beats the SPARC T5 cpu. Well, the T5 has twice the number of cores compared to the POWER7 cpu. So the T5 cpu is actually faster. So IBM is FUDing again.
https://blogs.oracle.com/jhenning/entry/ibm_transactions_per_core
And actually, SPARC T5 core is faster. For instance in TPC-C and TPC-H benchmarks, the T5 core is 24% faster than IBM POWER7.
2013-09-21 2:46 pm
Alfman verbose=1
Kebabbert,
Wow, that’s a long post, I’ll just respond to the main point.
“My point is that when IBM can not compete, IBM will start to FUD. As IBM always have done.”
I wasn’t denying that bias&FUD exists on the part of IBM, surely it does…however I’d take all statements with a pinch of salt, whether they’ve from IBM, Oracle (HP, SGI or anybody else).
I’ll always trust third party benchmarks over the “horse’s mouth”. Unfortunately with solaris, oracle are actually guilty of banning third parties from publishing solaris benchmarks. While this by itself doesn’t imply that their stuff performs badly, it does imply that oracle wants to hide the truth and that we only get to see benchmarks of their choosing.
http://www.phoronix.com/scan.php?page=news_item&px=ODc4Nw
2013-09-21 11:05 am
MrVain
Do you know if it’s *really* a linux problem instead of an x86 SMP scalability problem? I honestly don’t think x86 can scale efficiently beyond 8 cores under any OS.
You mean to say x86 does not scale beyond 8 sockets, not 8 cores. Sure, there are no larger SMP servers than 8 sockets x86 today, and has never been. SGI UV1000 server, is actually a NUMA cluster with 100s of sockets x86. But it is HPC cluster, so it is ruled out from this discussion because we discussing SMP servers, not HPC clusters.
Regarding scalability of Linux. If you look at SAP benchmarks using 2.8GHz Opteron cpus and fast RAM sticks, Linux has 87% cpu utilization on 8-socket server. That is quite bad cpu utilization. Solaris on same opteron cpus but slower at 2.6GHz and slower RAM sticks, gets 99% cpu utilization and beats Linux on SAP benchmarks. Solaris is using slower hardware, and beats Linux. 8-sockets is the maximum what Linux has been tested for, and Linux does not handle 8-sockets well. Linux Big Tux with 64 sockets, had 40% cpu utilization shows experimental benchmarks from HP, so HP could not sell Big Tux. So with 8 sockets, Linux had 87% cpu utilization, at 64 sockets Linux had ~40%. I guess at 16 sockets Linux will have cpu utilization at 70%, and rapidly fall off. Because Linux has not been tested nor optimized to handle 16 sockets – how could Linux scale 16 sockets well, the hardware does not exist!
My understanding is that linux will run on the same Sparc architectures that Solaris does:
Yes, it will. But how well? HP tried Linux for their 64 socket server, with awful results. I believe Linux will stutter and be very bad at 96 socket SPARC servers.
Do you have a benchmark of an apples to apples comparison between solaris and linux on the same processors (ignoring that such processors are not being sold with linux)?
There are benchmarks with Linux and Solaris on same or similar x86 hardware. On few cpus, Linux tends to win. On larger configurations, Solaris wins. But that is expected, because all Linux kernel devs sit with 1-2 socket PCs at home. Not many Linux devs has access to 8 socket servers. Linux vs Solaris on same hardware:
https://blogs.oracle.com/jimlaurent/entry/solaris_11_outperforms_rhe…
https://blogs.oracle.com/jimlaurent/entry/solaris_11_provides_smooth…
Mind you solaris *could* be better than linux for high end deployments.
There does not exist high end Linux deployments, and has never had. So for high end deployments, you have no other choice than go to Unix servers with 32 sockets or larger, from IBM, Oracle or HP. So I would be very very surprised if Solaris not was better than Linux. Unix kernel devs have for decades tested and tailored the kernel for 32 sockets and above – of course Unix must handle large servers better?
I’m genuinely curious about it, and if you have any evidence (benchmarks & case studies) that would be very informative to me.
People routinely runs Unix on large 32 sockets, or larger, for decades. So Unix should be comfortable running large servers without effort, I suspect.
For that matter, I’m very curious about the scalability of 64 core shared memory systems in general regardless of OS. Correct me if I’m wrong, but it seems to me that it would scale badly unless it were NUMA (or it had so much cache that it could effectively be used as NUMA).
The canonical example of a large SMP workload, is running databases in large configurations. As a kernel developer explains and talks about NUMA, SMP, HPC, etc:
http://gdamore.blogspot.se/2010/02/scalability-fud.html
“….First, one must consider the typical environment and problems that are dealt with in the HPC arena. In HPC (High Performance Computing), scientific problems are considered that are usually fully compute bound. That is to say, they spend a huge majority of their time in “user” and only a minuscule tiny amount of time in “sys”. I’d expect to find very, very few calls to inter-thread synchronization (like mutex locking) in such applications…
…Consider a massive non-clustered database. (Note that these days many databases are designed for clustered operation.) In this situation, there will be some kind of central coordinator for locking and table access, and such, plus a vast number of I/O operations to storage, and a vast number of hits against common memory. These kinds of systems spend a lot more time doing work in the operating system kernel. This situation is going to exercise the kernel a lot more fully, and give a much truer picture of “kernel scalability”
2013-09-21 4:00 pm
Alfman verbose=1
Kebabbert,
“You mean to say x86 does not scale beyond 8 sockets, not 8 cores. Sure, there are no larger SMP servers than 8 sockets x86 today, and has never been.”
Actually I meant this in the context of SMP versus NUMA. You said “All 32 socket Unix servers share some NUMA features, but they have very good RAM latency, so you treat them all as a true SMP server”. I’d really like to know the difference between x86 NUMA and “Unix server true SMP”, since as far as I know SMP requires NUMA in order to scale efficiently above 4-8 cores without very high memory contention. Saying that Solaris servers are different sounds an awful lot like marketing speak, but maybe I’m wrong. Can you point out a tangible technical difference?
“There are benchmarks with Linux and Solaris on same or similar x86 hardware. On few cpus, Linux tends to win. On larger configurations, Solaris wins.”
“Linux vs Solaris on same hardware:”
I thank you for looking these up. I really wish they were using *identical* hardware and only switching a single variable between tests (instead of switching the OS AND the hardware vendor).
Still, the benchmarks are interesting.
This shows a glaring scalability problem with RHL. We’re left to infer that RHL has a scalability problem compared to the Solaris chart on the same page.
http://blogs.oracle.com/jimlaurent/resource/HPDL980Chart.jpg
However another chart on a different blog post (on different hardware) doesn’t show the scalability problem under RHL.
http://blogs.oracle.com/jimlaurent/resource/HPML350Chart.jpg
So was the problem with Red Hat Linux, was it the hardware, OS, software, the number of cores? We really don’t know. Surely any employee worth his salt would have performed the benchmarks in an apples to apples hardware/software configuration, why weren’t those results posted?
As before, I’m not asserting that Solaris isn’t better, it very well may be, but it would be naive to trust Oracle sources at face value.
“Consider a massive non-clustered database. In this situation, there will be some kind of central coordinator for locking and table access, and such, plus a vast number of I/O operations to storage, and a vast number of hits against common memory.”
I’d think this design is suboptimal for scalability. A scalable design would NOT have a single central coordinator, there should be many (ie one per table or shard) running in parallel even though it’s not clustered. To be optimal on NUMA software should be specifically coded to use it, however you are probably right that vendors are treating it as clustered nodes instead. They haven’t gotten around to rewriting the database engines to take advantage of NUMA specifically.
Can you disclose whether you are connected to oracle?
Edited 2013-09-21 16:06 UTC
2013-09-22 2:20 pm
MrVain
Actually I meant this in the context of SMP versus NUMA. You said “All 32 socket Unix servers share some NUMA features, but they have very good RAM latency, so you treat them all as a true SMP server”. I’d really like to know the difference between x86 NUMA and “Unix server true SMP”, since as far as I know SMP requires NUMA in order to scale efficiently above 4-8 cores without very high memory contention. Saying that Solaris servers are different sounds an awful lot like marketing speak, but maybe I’m wrong. Can you point out a tangible technical difference?
Here is some information on these “different Solaris servers”. I mean they are different, because they are well built, minimizing memory latency. Look at the last picture at the bottom:
http://www.theregister.co.uk/2012/09/04/oracle_sparc_t5_processor/
“…Turullols said you need one hop between sockets to scale. It usually takes two hops to get to eight-way NUMA given current designs, so this is where that near linear scalability is coming from….”
You see that this SPARC T5 8-socket server is connected to every other cpu, via 28 lanes. And a cpu can reach any memory cell in at most one jump – which means latency is very low. This is the reason this 8-socket server scales linearly. There are many 8-socket servers where you only get 5 cpus out of 8, or so. They scale bad with many hops.
Now imagine the Oracle M6 server with 96 cpus connected to each other, there would be 4560 lanes. That is too much and messy. So how to build a 96 socket server that scales well? Look at the bottom picture on the coming M6 server:
http://www.theregister.co.uk/2013/08/28/oracle_sparc_m6_bixby_inter…
Bronek:”…If you look at the picture carefully you will find that all CPUs can connect to others directly (7 cores), via single BX (12 cores) or a BX and a CPU (i.e. single hop, remaining 12 cores). This all with 4Tb/s bandwidth to maintain cache coherency across sockets – I think that’s some really nice engineering…”
So, this M6 server has all cpus connected to another in only a few hops at worst case. It looks like the latency will be a few 100ns at worst. On the other hand, a HPC cluster have worst case latency of 10.000ns – which makes them only usable for parallel workloads where you dont need to access data far away.
This M6 server is for running huge nonclustered database configurations, all from memory. Oracle is concerned with the SAP Hana memory database, and this is Oracle’s answer: a huge SMP-like server capable of running everything from RAM. So SAPs Hana RAM database is not a threat to Oracle’s database. Thinks Larry Ellison.
This M6 server is very intricate built as we all can see. The are no other vendor building large database SMP servers with sockets more than 32, than Oracle and Fujitsu (the new 64-socket SPARC64 server M4-10s). As far as I know. HP has a 64 socket server, but it is old and not updated. I dont know if it is sold longer.
Anyway, you will not see a Linux NUMA cluster server running non clustered databases.
I thank you for looking these up. I really wish they were using *identical* hardware and only switching a single variable between tests (instead of switching the OS AND the hardware vendor).
Here are the 8-socket Solaris vs Linux SAP benchmarks I talked of. They use very similar hardware, opteron cpus of almost the same model, but Linux uses higher clocked. Linux has 128GB RAM and Solaris 256GB, because the Linux HP benchmarking team wanted to use faster RAM memory sticks, so they had to use 128GB RAM. Solaris uses slower memory sticks.
download.sap.com/download.epd?context=B1FEF26EB0CC34664FC7E80B933FCCAC 80DD88CBFAF48C8D126FB65D80D09E988311DE75E0922A14
download.sap.com/download.epd?context=40E2D9D5E00EEF7CCDB0588464276DE2 F0B2EC7F6C1CB666ECFCA652F4AD1B4C
This shows a glaring scalability problem with RHL. We’re left to infer that RHL has a scalability problem compared to the Solaris chart on the same page.
http://blogs.oracle.com/jimlaurent/resource/HPDL980Chart.jpg
However another chart on a different blog post (on different hardware) doesn’t show the scalability problem under RHL.
http://blogs.oracle.com/jimlaurent/resource/HPML350Chart.jpg
There are not the same scalability problem, but it has other problems. The Linux graph is very stuttering and not smooth. Linux struggles with the workload, and is very stuttery. Solaris is not.
As before, I’m not asserting that Solaris isn’t better, it very well may be, but it would be naive to trust Oracle sources at face value.
Linux has never been tested on larger servers than 8-sockets, so I would be very surprised if Linux could scale well. But yes, I agree you need to be careful with Oracle marketing, too. I prefer independent benchmarks. If they dont exist, we can do nothing. But still, the Oracle benchmarks shows huge performance advantages to any other cpu or OS. I expect Oracle could tweak benchmarks slightly, but not completely? It should not be possible to make a lousy cpu look great? Or?
Can you disclose whether you are connected to oracle?
Sure. I am not connected to Oracle in any way. I work in finance, not IT. I just happen to be a geek liking the best tech out there. I admire good tech. I like the IBM POWER7 when it was released because it was the best, back then, better than SPARC. And I said so, in posts, yes. I acknowledged the superiority of POWER7, back then. I am also a fan of Plan9. In my opinion it might be the most innovative OS of them all. I prefer GO, to Java. etc. I just like the best tech. It does not really matter who is doing it. OpenBSD for security. Solaris for being the most innovative Unix. SPARC for the fastest cpu. ZFS for the safest fileystem. etc. If BTRFS would be better than ZFS, I would switch and dont look back. I am pragmatic, prefer the best tech.
But I dont like lies and FUD. To that I react and I want to dispel FUD.
-IBM Mainframes have very weak cpus, they are not strong. No matter what IBM says.
-Linux scales quite bad. No matter what Linus Torvalds say.
-Linux code quality is non optimal. Which Torvalds and other kernel devs agrees on. Here is what Con Kolivas, the famous Linux kernel developer, says when he compares source code quality of Solaris to Linux:
http://ck-hack.blogspot.se/2010/10/other-schedulers-illumos.html
http://www.forbes.com/2005/06/16/linux-bsd-unix-cz_dl_0616theo.html
http://www.theregister.co.uk/2009/09/22/linus_torvalds_linux_bloate…
2013-09-22 5:08 pm
Alfman verbose=1
Kebabbert,
“Linux has never been tested on larger servers than 8-sockets, so I would be very surprised if Linux could scale well.”
“But I dont like lies and FUD. To that I react and I want to dispel FUD.”
“-IBM Mainframes have very weak cpus, they are not strong. No matter what IBM says.”
So you admit that you’ve never seen the data? Neither have I. This is the problem; everything you or I say is mere speculation.
Surely it must have been tested by oracle on an apples to apples comparison, but we just aren’t allowed to see the results they don’t approve of. Off the top of my head VMWare does the same thing and it’s annoying as hell that their marketing heads say one thing and third parties aren’t even allowed to publish contradictory evidence.
I use&like Oracle’s database products, they’re top notch, but censoring benchmarks sure is fishy. You are clearly taking everything oracle says at face value, and provided that I were to take everything they said at face value, then you are right: linux scales poorly.
However I frankly wouldn’t put it above them to employ the same marketing FUD and bias that you are accusing their competitors of doing. Understand that I’m just trying to explain why I’m skeptical, not trying to persuade you that you are wrong. Like you, I just don’t have the data that would settle the question in factual terms.
“-Linux scales quite bad. No matter what Linus Torvalds say.”
“-Linux code quality is non optimal. Which Torvalds and other kernel devs agrees on.”
I actually agree with you, the code quality isn’t great and IMHO the kernel abstractions are poor. Linus refuses to have stable API/ABIs and consequently individual pieces of code are rarely stable even if they don’t have bugs in them. I have many gripes with linux and am not a blind fanboy, however none of this actually speaks against linux scalability on SMP.
In order to form my opinion on the matter, I’d want more impartial data, and ideally some evidence that the system is correctly configured.
2013-09-22 6:48 pm
MrVain
You are clearly taking everything oracle says at face value, and provided that I were to take everything they said at face value, then you are right: linux scales poorly.
I am not really accepting everything that Oracle says. But I “trust” Oracle more than IBM, because IBM has time and again been proven to do pure FUD, they started the FUD thing in first place. IBM = masters of FUD.
I trust benchmarks more, than subjective marketing slogans. I even mostly accept IBM benchmarks. I want to see hard numbers, benchmarks.
Oracle actually has nothing against Linux. Instead, Oracle promotes Linux and bets heavily on Linux too. So Oracle does not dispel the Linux FUD – I am doing that. Oracle does not care, as long as they sell and earn money. However, the margin are better in high end servers, costing millions of USD.
Regarding non scalability of Linux. Even if you dont agree with Linux scaling bad – do you agree that there has never existed > 8-socket Linux servers? A) No one sells such large Linux servers – do you agree on this?
B) If you agree on this, do you agree that Linux has not been tested nor optimized on > 8 socket servers?
C) If you agree on B), do you agree that it is highly probable that Linux scales bad, because no kernel developer has ever tweaked Linux into many socket territory?
Which of A), B) or C) do agree with, and which do you disagree with? If you agree on all three, then we both agree that it is “highly probable that Linux scales bad”, right? Not “Linux scales bad”, but “highly probable it scales bad”, right?
I have not proof on this, but there are benchmarks on similar hardware where Linux stutters and behaves bad. And HP benhcmarks showed awful performance on 64 socket server. Sure, there might be some work loads that Linux actually handles ok, but in general it is highly probable it scales bad.
So, do we both agree on “highly probable that Linux scales bad”? Or do you prefer “probable that Linux scales bad”? Or none of this?
PS. I wonder, for these SGI UV1000 NUMA servers, are developers always using MPI and similar cluster libraries, when developing software for it? You must always use MPI when developing for SGI NUMA servers? (I am convinced the Solaris SMP alike servers are not using MPI. I am convinced they just copy the Solaris binaries to the SMP servers, without rewriting them.)

2013-09-20 8:07 pm
phoenix
The question is if Linux scales on a single fat server. And Linux does not scale on such a SMP server, because there have never existed such a Linux server with 16 or 32 cpus for sale.
Do you mean 16 or 32 individual CPU sockets on a single motherboard? Or do you mean 16 or 32 individual CPU cores spread across 4 (or more) physical CPUs in separate physical sockets?
If you mean the former, sure, I’ve never heard of one of those, running any OS. They probably exist, I’ve just never seen anything online or IRL with more than 4 physical CPU sockets.
If you mean the latter, you’re talking out your arse. We have 16 core servers running Linux right here in our data centre (Debian Linux + KVM running on a dual-socket mobo with 8 cores per socket). And SuperMicro makes motherboards that support 4 physical sockets, with 16 cores per socket (AMD, so no hyperthreading crap) for a total of 48 CPU cores … all fully supported by Linux.
Perhaps you should clarify which you mean (physical CPUs or CPU cores).

2013-09-21 11:09 am
MrVain
The question is if Linux scales on a single fat server. And Linux does not scale on such a SMP server, because there have never existed such a Linux server with 16 or 32 cpus for sale.
Do you mean 16 or 32 individual CPU sockets on a single motherboard? Or do you mean 16 or 32 individual CPU cores spread across 4 (or more) physical CPUs in separate physical sockets?
If you mean the former, sure, I’ve never heard of one of those, running any OS. They probably exist, I’ve just never seen anything online or IRL with more than 4 physical CPU sockets.
If you mean the latter, you’re talking out your arse. We have 16 core servers running Linux right here in our data centre (Debian Linux + KVM running on a dual-socket mobo with 8 cores per socket). And SuperMicro makes motherboards that support 4 physical sockets, with 16 cores per socket (AMD, so no hyperthreading crap) for a total of 48 CPU cores … all fully supported by Linux.
Perhaps you should clarify which you mean (physical CPUs or CPU cores).
In a socket, there is only one cpu, yes? You dont put two or three cpus into one socket, no? Sure, a cpu might have 4 cores or so. But still, in a socket there is only one cpu. It is a bijection between cpus and sockets.
So, yes, I mean 32 socket Linux servers. Not 32 core Linux servers. There has never existed 32 socket Linux servers for sale. If anyone can find such a large Linux server, please post it here. Then I will stop saying this.

2013-09-22 5:56 am
znby
SMP servers have worst case latency of a few 100ns, so you program clusters totally different from a true SMP server. An SMP server is just an ordinary server, and you dont need to redesign your software, just program as normal. Just copy your normal binaries to the SMP server and run it. If you try to copy normal binaries to a HPC cluster, it will not work because you need to redesign software so data is close to adjacent nodes, etc – otherwise performance will be very bad.
SMP does not scale to any appreciable core count. In fact, on individual CPUs with large core counts (like the 16-core Interlagos Opterons) you will find that certain cores will be in one memory region, and other cores will be in other regions, and you need to concern yourself with this fact if you want to write code that makes optimal use of the chip. I also fail to see how something can have “features of NUMA”. Either all CPUs can access memory at a latency that is unaffected by which address you are accessing, or the latency/access time is dependent on which region of memory you access, and it’s a NUMA machine. You said yourself that these Sparc “SMP” machines have different latencies depending on which memory region you access, which would imply that you do need to worry about NUMA issues on one of these machines if you really wish to get top performance.
Also, given that these are hardware/architectural issues, I fail to see what the choice of operating system is, or what exactly prevents Linux from being useful in this scenario. The choice of Solaris probably has more to do with vendor lock-in and the fact that Solaris has been optimized for Sparc since before Linux even existed…
Edited 2013-09-22 06:00 UTC

2013-09-22 2:26 pm
MrVain
SMP servers have worst case latency of a few 100ns, so you program clusters totally different from a true SMP server. An SMP server is just an ordinary server, and you dont need to redesign your software, just program as normal. Just copy your normal binaries to the SMP server and run it. If you try to copy normal binaries to a HPC cluster, it will not work because you need to redesign software so data is close to adjacent nodes, etc – otherwise performance will be very bad.
SMP does not scale to any appreciable core count. In fact, on individual CPUs with large core counts (like the 16-core Interlagos Opterons) you will find that certain cores will be in one memory region, and other cores will be in other regions, and you need to concern yourself with this fact if you want to write code that makes optimal use of the chip. I also fail to see how something can have “features of NUMA”. Either all CPUs can access memory at a latency that is unaffected by which address you are accessing, or the latency/access time is dependent on which region of memory you access, and it’s a NUMA machine. You said yourself that these Sparc “SMP” machines have different latencies depending on which memory region you access, which would imply that you do need to worry about NUMA issues on one of these machines if you really wish to get top performance.
Yes, all modern SMP servers are somewhat NUMA, read my post above on the new Oracle M6 server. The point is that there is a huge difference in programming when worst case memory latency is a few 100ns, or 10.000ns. The latter must be treated as a cluster.
Also, given that these are hardware/architectural issues, I fail to see what the choice of operating system is, or what exactly prevents Linux from being useful in this scenario. The choice of Solaris probably has more to do with vendor lock-in and the fact that Solaris has been optimized for Sparc since before Linux even existed…
Benchmarks show that Linux has problems utilizing all cpus well, on as few as 8-sockets. Which is to be expected because there does not exist any larger Linux servers than 8 sockets. 8-socket servers are on the verge of what Linux can handle. Or do you mean that Linux would scale well even on 32 socket servers, when it had not even been tested nor optimized for such servers?
So we all agree that Linux can not scale well, when it stops at 8-sockets?

2013-09-22 5:23 pm
Alfman verbose=1
Kebabbert,
“So we all agree that Linux can not scale well, when it stops at 8-sockets?”
Absolutely not. You really shouldn’t expect to convince others with this kind of logic. In the absence of evidence, I could say anything I wanted, but it doesn’t make it true.
I can appreciate your opinion and insights, but you go too far when you try asserting things as fact when you admit that you haven’t seen any data.
Here’s a benchmark for postgre 9.2 that shows almost linear scaling up to 64 cores with linux 3.2. It’s not solaris/oracle, but it does prove that linux can scale for well tuned databases (note the major bottleneck with postgre 9.1).
\http://rhaas.blogspot.com/2012/04/did-i-say-32-cores-how-about-64.h…
How would linux do on a 64 *socket* server, I don’t know, but I’m sure as heck not going to state my prediction as a fact.