pink cylinders

More than Four

...there's an axis for that.

A nice enhancement in PL/SQL Developer .NEXT

I've been taking a look at the public beta of the next version of PL/SQL Developer. I'm currently using the 8.0 release, choosing not to upgrade to 9.0 when it was released last year. One of the new features Allround Automations added in 9.0 was the ability for one instance of the application to open sessions to multiple databases. I considered this a pretty risky enhancement.

However, in the upcoming 10.0 release, this multi-session feature seems to be refined and more solid. One nice enhancement is the ability to assign a window background color to specific connections.

Here, I am connecting to the QA, UAT, and PROD environments: 

Now there is much less chance of accidently running a data update on the wrong DB! This feature would have been quite useful in 8.0 and earlier, frankly. It's nice to have an obvious reminder of which environment you're looking at.

Data Prospecting

Lisa and I drove into the city last Wednesday to attend the San Francisco SQL Server User Group meeting, to see a presentation of Microsoft's latest Data Mining tools. It was a great presentation, very enlightening.

I confess, I've always been under a misconception about data mining. For reasons I'll explain below, I've had a bias that "data mining" was all about exploring your data and looking for unexpected trends and relationships that perhaps could then be leveraged.

However, at the user group presentation, as William Brown took us through a data mining exercise, several things became clear to me:

In Data Mining projects, the following axioms are important:

  • Know what you're looking for
  • Know how to recognize it when you've found it

William started with the following goal: "Of my database of customers who haven't purchased a bicycle, which of them are likely to buy one?" 

He then used the Microsoft toolset to create a "sieve" through which he could pour the customer database and build a collection of likely bicycle purchasers.

This was possible because he started with a table of customers known to be bicycle purchasers, and used a subset of it to effectively "train" the sieve to identify rows with similar properties. He also used the remainder of the known data set to test the sieve's effectiveness.

This is almost the exact opposite of what I had presumed "Data Mining" was all about, but I stand corrected: "Data Mining" is an excellent description of the process. 

You go into the project knowing what it is you want to find, and you have a test that can be used to prove if you are successful. 

Digression: Arthur Ransome's Pigeon Post

In the book Pigeon Post by Arthur Ransome, a team of child explorers decide to go prospecting for gold on the local hills. They base this decision on an old rumor of a lost gold seam up in the hills. Adventures ensue, of course. It's a great book, with detailed descriptions of the prospecting, ore crushing, smelting, and charcoal burning processes. After an unsuccessful attempt to melt the gold dust into a nugget - it just oxides into a black lump - they know that just finding gold-colored ore isn't enough: They need a test to prove the result to their uncle, Captain Flint, who has given up on fruitless treasure hunts:

"It probably wasn't gold at all," said Susan. "We ought to have made sure."

Dick looked up suddenly. "There is one way we could," he said. "We could try with aqua regia. You know, a chemical test. The book says gold dissolves in aqua regia. If it does, we'd know for sure."

More ore is extracted, and later on Dick performs the test, working in Captain Flint's workshop laboratory. 

Dick's tests show that the ore does indeed dissolve in aqua regia (a mixture of two strong acids), but his joy is short-lived when Captain Flint returns and reviews the results:

"What's this? Chemistry?" asked Captain Flint.

"Gold," said Dick.

"Gold? Don't you go and get interested in the wretched stuff. I've sworn off. Had quite enough wasting time... What have you got in that test tube?"

"Aqua regia," said Dick. "And gold dust. And it's gone all right. Dissolved! Gold in the aqua regia. I was just a bit afraid it might not be gold, after all."

"But, my dear chap," said Captain Flint. "Aqua regia will dissolve almost anything. The point about gold is that it won't dissolve in anything else."

Captain Flint examines the ore and, using his knowledge of the geology of the surrounding hills, comes up with a better test to determine what, exactly, the prospectors have found.

Captain Flint's test reveals the ore to be copper pyrites, which in its way, turns out to be quite as good a discovery as gold would have been, and the adventure is a success after all. 

There's several insights we can pull from this story:

  • Choosing an appropriate "test for success" is critical.
  • Domain knowledge is very important, and the earlier you apply it, the better off you'll be.

But here's another thing: The prospectors weren't looking for copper. They were looking for gold. And without some help, they weren't ever going to understand what they had actually found.

This last point makes me think that there is something missing in the current Data Mining toolset.

Reductio ad memoria 

Over a decade ago, I worked on a project with some truly kick-ass data visualization tools: Silicon Graphic's MineSet. The project involved showing how a company could mix their own customer data with additional data sourced from a third-party, and throw it into a 3D visualization, allowing variation along five or more axes simultaneously.  

For example, in addition to X, Y, and Z axis, we also had color, density, size of the "splat", and probably some others that I can't recall. The thrill of taking this 3D model and spinning it around various viewing angles to see correlations between data axes - in real-time, in a web browser! - was, frankly, mind-blowing at the time.

Now, in my opinion the project's premise was flawed, because the nature of the third-party data didn't really allow for a 1:1 join with the primary data set - it was more of an exercise in psychohistory.

But the tool was brilliant. It even had an Internet Explorer ActiveX viewer control for browsers.  

One feature I would have liked to have seen was a "randomize axes" option, where you could cycle through the various available columns in your data set, mapping them to different aspects of the 3D view, and then spin the result around, looking for unexpected relationships.

Because it's the unexpected that is truly insightful. I want to see the copper pyrites - or whatever else is in there - without having to explicitly test for it.

I don't think this is Data Mining as it is understood today. But perhaps it might be called "Data Prospecting".

Finally

A quick web search indicates that, in October 2003, the SGI MineSet product was licensed for development and sale to Purple Insight, a Gloucester, UK-based company that now appears to be defunct. This is a shame, because these visualization tools were unsurpassed by anything I've seen to date. 

Lisa tells me that some of the underlying algorithms found their way into IBM's Many Eyes project

Note: By the way, all of Arthur Ransome's books about the young explorers are great, and I highly recommend them, for children of all ages.

Test Post

Normal text

[code:c#]select * from SYS.ALL_OBJECTS where OBJECT_TYPE='PACKAGE';[/code]

Normal text

[code:vb]select * from SYS.ALL_OBJECTS where OBJECT_TYPE='PACKAGE';[/code]

Normal text

 

Avoiding artificial inflation in SUM on a denormalized column value

Say you have a data model that looks like this:

In other words, time entries can be tagged against multiple tasks. (I know this seems wrong, but trust me, it made sense in the context of the system.)

The problem we had was that the complicated query behind the report of "total hours by task" was counting the hours twice or more, for any entry tagged to more than one task.

[code:c#]select T.NAME, sum(L.HOURS) 
 from time_x_task X 
  join timelog L on L.ID=X.TIME_ID
  join task T on T.ID=X.TASK_ID
  group by T.NAME; [/code]

After thinking about how I could alter the query so that each timelog.HOURS was included only once in the SUM(), it occurred to me that it might be better to just distribute the hours across the tasks:

[code:c#]select L.ID, T.NAME, L.HOURS, L2.cnt, (L.HOURS/L2.CNT) as ACTUAL_HRS
 from time_x_task X
 join task T on T.ID=X.TASK_ID
 join timelog L on L.ID=X.TIME_ID
 join (select count(1) as CNT, TIME_ID 
         from time_x_task 
         group by TIME_ID) L2 on L2.TIME_ID=L.ID;[/code]

Now we SUM(ACTUAL_HRS) instead of sum(HOURS).

OK, this is a contrived example, but the point I'm trying to make is that sometimes you have to throw away the original question and think about what you're really trying to achieve. 

A light-weight SQL editor plugin for Eclipse IDE

Eclipse is not my primary IDE, but I do find it very useful for editing ANT project definitions, especially the outline view.

I use generic projects in Eclipse Juno 4.2 to manage collections of DDL scripts, but I did miss SQL syntax highlighting. Well, no more! 

I found this plugin on the Internet: http://blog.debertshaeuser.com/index.php?page=eclipse-sql-editor.

Just download the jar and put it in the eclipse/plugins folder, then re-start Eclipse. It works great:

Long columns wrapping in the SQL Result Grid

Have you ever experienced a moment when you find that an application you've used for years has a feature that, if only you'd known about it, would have saved you time over and over again? It's a mixed feeling... one one hand, it's "Yay, this software is awesome!" and the other it's "Doh! I'm an idiot!". 

Well, PL/SQL Developer just handed me another of those moments. 

In the result grid of the SQL Query window, you can resize any row by grabbing the separator line in the margin and pulling down. I think I knew that... but what didn't realize is that any column with a long string will then wrap into the extra space, like this:

I think that is pretty cool. And it's been in the product since at least version 8 (which is what I've used for the last few years).