Pinderkent

Pain and glory from the trenches of the IT world.

NoSQL, the next big mess we'll get to clean up.

Posted on Sunday, April 04, 2010 at 9:56 PM.

Over the past couple of years, we've been hearing more and more about the so-called "NoSQL" movement. In short, its adherents advocate the use of various data management systems that do away with some of the features that most relational database systems have come to offer, in favor of supposedly offering better performance for large data sets. The hype has become particularly strong lately, with it being revealed less than a month ago that Digg has started using Cassandra heavily. A few days later, a similar announcement was made regarding reddit.

There has been a fair amount of discussion regarding this topic. Dennis Forbes, for instance, discusses Digg's transition, and explains how properly using some of the most integral features of virtually all existing relational database systems, along with solid-state drives, can help alleviate many performance issues. We've also seen Ted Dziuba write about the risk and unnecessity of NoSQL-esque approaches for most situations, while Royans Tharakan has suggested the opposite to be true. Jeremy Zawodny describes NoSQL as "software Darwinism".

Regardless of how one personally feels about NoSQL or relational database systems, I can think of a few things that will likely hold true:

  1. NoSQL techniques and systems will continue to get the sort of hype that misleads many developers and managers into thinking it's an approach that's much better than it actually is.
  2. Numerous existing software systems currently using relational databases very successfully will be transitioned to using using a NoSQL approach.
  3. Many new software systems will use NoSQL technologies, especially when it isn't necessary or even suitable to do so.
  4. These new and modified systems will fail horribly. Expected performance gains won't materialize, data will be lost or badly mangled due to NoSQL's general lack of focus on data consistency, and codebases will be ruined by these transitions.

This is a mixed blessing. On one hand, it will ensure a lot of work for those of us who often get called in to deal with software blunders. But this isn't truly productive work, of course. It's mainly just fixing mistakes that shouldn't have been made in the first place. Techniques and software that worked for NoSQL users like Facebook, Google, Digg or reddit just won't work across the board, and it's quite unfortunate that so many developers and development managers won't realize this until it's far too late.

Permalink: http://pinderkent.phumblog.com/post/2010/04/nosql_the_next_big_mess_well_get_to_clean_up
Share:

Analyzing existing databases and their relationships with applications.

Posted on Saturday, October 17, 2009 at 3:24 AM.

Anybody working on business applications these days will undoubtedly have to familiarize himself or herself with one or more existing databases. These databases have often been "grown" rather than designed in any meaningful way, and thus will be littered with unused tables, unused functions or stored procedures, missing constraints, poor normalization, and a host of other problems.

Developers in such a situation will often look for an easy way out, such as the use of tools to automatically reverse-engineer various parts of the existing database or databases. While these tools can be useful, I don't think they are ever a replacement for just stepping through the code, line-by-line, and observing exactly what queries are executed.

Depending on the application, it may even be a bad idea to try and think of the application and database as separate. Many times we find that one can't exist without the other, and vice versa. For instance, we find hard-coded SQL queries within the application code written in languages like Java, PHP or C#. In such situations, one literally has to take a debugger and step through the application code in order to get even a basic understanding of how the system works.

Another thing that may be worth avoiding is trying to understand the system all at once. Often, it will take many months to truly grasp even a moderately sized application and the database behind it. As changes are made or bugs are fixed, take a moment or two to study and document the code paths that are involved. Doing this on a daily basis will eventually expose a developer to large portions of the software system they're working with.

Regardless of the approach, one thing to keep in mind that it's not an easy task becoming familiar with existing codebases and databases, especially when they're as ugly as so many real-world systems are. Give it time, remain patient, and eventually the system will start to feel much smaller and manageable.

Permalink: http://pinderkent.phumblog.com/post/2009/10/analyzing_existing_databases_and_their_relationships_with_applications
Share:

Losing developer time to performance problems hidden by high-level languages.

Posted on Saturday, May 23, 2009 at 11:48 PM.

One of the main purposes of high-level programming languages is to save developer time by abstracting away the onerous and tedious aspects of the underlying hardware. In general, most high-level languages tend to do a good job at this. Unfortunately, we see these same high-level languages wasting significant amounts of developer time. Many times, this is due to performance problems. What becomes problematic, however, is that in order to properly diagnose and fix many of these performance problems, the developers involved need to obtain a high degree of understanding about the implementation of the high-level language that's involved.

A good example of this is a performance issue described recently with IronPython, an implementation of Python for Microsoft's .NET platform. In short, a very innocuous line of code was apparently responsible for the poor performance.

This incident highlights several main problems. The first is that high-level code can lead to some very unexpected interactions within the high-level language's implementation. This can obviously cause problems by misleading the developer or developers dealing with the performance problems. What appears on the surface to be a simple and likely very fast operation ends up being the culprit. A lot of developer time can be spent looking in the wrong places.

The second concern is that tracking down the problem requires in-depth knowledge about the high-level language's implementation. To some extent, we use such high-level languages in the first place to avoid needing to acquire such lower-level knowledge. We want to focus on the application we're writing, not on dealing with issues pertaining to the platform we're building upon. Time spent learning about the high-level language's implementation is time not spent on developing the application at hand.

This particular situation seems to have had a "happy" ending. The victim of the poor performance got a rapid response from somebody who did have inside knowledge about IronPython's implementation. Unfortunately, this isn't always the case. I've seen far too many times when developers have spun their wheels trying to track down obscure performance problems of that type. And it isn't a problem associated just with programming languages like Python, Ruby, or Perl, either. We often see it happen with SQL. A minor change to a query can result in a huge performance gain or loss.

As we start using high-level programming language implementations like IronPython, Scala, Clojure and JRuby, which are themselves often implemented in high-level programming languages like Java or C#, which in turn run on some sort of a virtual machine, we'll run into these sorts of problems more and more frequently. Each additional layer of software abstraction that we add in makes the situation more and more difficult. Soon we may need to look in two or three very different layers of software, assuming we even have source access, to track down performance issues. This could very well lead to a serious waste of developer time and effort.

Permalink: http://pinderkent.phumblog.com/post/2009/05/losing_developer_time_to_performance_problems_hidden_by_highlevel_languages
Share:

A lack of good documentation is one of the main problems with Firebird.

Posted on Wednesday, April 29, 2009 at 3:34 AM.

The recent Why so few developers are using Firebird SQL? article has generated a fair amount of discussion, including some at Reddit. For those who might not be aware, Firebird is an open source RDBMS based on the InterBase 6.0 source code that Borland released just under a decade ago. Since then, it has been under continuous development, but really hasn't caught on like other open source databases such as PostgreSQL, MySQL and SQLite have.

About a year ago I was working with a company who was considering the use of Firebird for a new in-house application they were developing. I wasn't directly involved with this particular project, but did talk to some of the developers working on it. Having used InterBase years ago while working on some Delphi-based software, and having heard of Firebird, I was interested in seeing what they had to say about it. While they didn't have many technical complaints, there were a few factors that resulted in them opting to use PostgreSQL instead.

Technically, InterBase and Firebird aren't bad database systems by any means. They do offer exactly what's needed and what's expected by many users. My experience with InterBase years back was that its performance and reliability were quite suitable, and I don't have any reason to think the situation would be any different now. Personally, I would entrust Firebird with valuable data and availability over MySQL. The project I mentioned earlier basically had the same opinion, as I recall. Their complaints weren't of a technical nature.

One significant complaint they did have was with Firebird's documentation. When they looked at it, it was basically the InterBase 6.0 manuals with separate documentation outlining the changes and additions made by the Firebird developers. Checking the Firebird documentation page now, about a year later, it seems that it is still a combination of the InterBase 6.0 manuals and the Firebird 2.0 Language Reference Update document.

Not all of the developers working on the project had used InterBase 6.0, and facing relatively tight deadlines, they didn't expect to be able to get everyone up to speed fast enough if they had to become familiar with InterBase 6.0 first, and then "patch" that knowledge with the Firebird updates. One major benefit of PostgreSQL is that it offers very comprehensive and accessible documentation online. The vast majority of questions and issues can be resolved by referring to the appropriate section of their documentation.

The developers and managers of the project I mentioned earlier also felt more comfortable with the community and development processes around PostgreSQL. I'm not sure what sort of research they did to come to this conclusion, but I recall them saying that they thought the PostgreSQL developer and user community was more "stable". My interest in their findings was more technically-oriented, so I didn't follow up much with respect to this.

Personally, I'd like to see greater adoption of Firebird. I think it has technical merit, and given its heritage it should be production-ready for many users. Streamlining the documentation might help encourage its adoption. It won't be an easy task by any means, but if the Firebird project could produce and then maintain some documentation on par with that which the PostgreSQL project has put together, the burden on new users would be eased greatly. We may then see more people willing to at least give it a try.

Permalink: http://pinderkent.phumblog.com/post/2009/04/a_lack_of_good_documentation_is_one_of_the_main_problems_with_firebird
Share:

"MySQL or PostgreSQL?" is not the question to ask today.

Posted on Sunday, April 26, 2009 at 12:14 AM.

Recently, there was a thread at Hacker News about whether MySQL or PostgreSQL is a better option. These days, I'm not so sure that's even the right question to be asking.

There are many factors to take into account when choosing a database. For some situations, you basically need the functionality offered by advanced database systems like DB2 and Oracle. But many databases need only a small subset of the functionality offered by the more complex database systems. In such situations, the various open source databases are typically a suitable solution.

If you're working on a smaller-scale system that will nevertheless still need to store larger amounts of data, perform more complex queries and support numerous simultaneous users, then PostgreSQL is clearly the way to go. It has over two decades of maturity behind it, and this shows by its high degree of quality and its rich feature set. Even after spending some time working with higher-end database systems, one doesn't generally feel out of place when using PostgreSQL. It typically offers the features and performance necessary for the majority of database-related tasks.

One of the main complaints regarding PostgreSQL is with respect to its support for replication. I'm not sure if these complaints are with much merit. I've seen several enterprise systems making use of Slony-I for replication, with very acceptable results. The performance wasn't always as good as the DBAs I talked to would have liked, but that alone wasn't enough to drive them to other database systems.

On the other hand, for database applications where the data set is smaller, the hardware or hosting environment is more constrained, or the database will be used mainly in a read-only capacity, then SQLite is often the way to go. For being such a small database system, it proves to be a very practical and flexible solution.

Aside from those using it in an embedded or constrained environment, I've seen several companies who use it to power intranet Web sites and applications that receive a fair amount of traffic and use. Given the power of even low-end server hardware today, SQLite proves to be a very effective replacement for MySQL. The fact that it is server-less reduces setup times to almost nothing, and makes creating backups quite painless.

So when intentionally going with an open source database today, SQLite or PostgreSQL are typically the best candidates to choose, depending on the exact situation at hand. To a large degree, we have even seen them squeeze MySQL out of the picture. For those looking for features, reliability, and performance, PostgreSQL is often a better choice than MySQL. On the low end, SQLite usually offers better resource usage, involves less setup and maintenance effort, is easier to back up, but yet is still capable of handling sizable loads without performance problems arising.

So as I mentioned earlier, I think that the question today should not involve MySQL and PostgreSQL, but should instead involve SQLite and PostgreSQL. Only after we have asked the right question can we start to determine which database system is actually better for the needs at hand.

Permalink: http://pinderkent.phumblog.com/post/2009/04/mysql_or_postgresql_is_not_the_question_to_ask_today
Share:
Feeds
  • RSS 2.0 Feed
  • Atom 2.0 Feed
Tags
Archives