Areas of Interest, as counted by my cat

Author: colin (Page 7 of 13)

Data Prospecting

Lisa and I drove into the city last Wednesday to attend the San Francisco SQL Server User Group meeting, to see a presentation of Microsoft’s latest Data Mining tools. It was a great presentation, very enlightening.

I confess, I’ve always been under a misconception about data mining. For reasons I’ll explain below, I’ve had a bias that “data mining” was all about exploring your data and looking for unexpected trends and relationships that perhaps could then be leveraged.

However, at the user group presentation, as William Brown took us through a data mining exercise, several things became clear to me:

In Data Mining projects, the following axioms are important:

  • Know what you’re looking for
  • Know how to recognize it when you’ve found it

William started with the following goal: “Of my database of customers who haven’t purchased a bicycle, which of them are likely to buy one?”

He then used the Microsoft toolset to create a “sieve” through which he could pour the customer database and build a collection of likely bicycle purchasers.

This was possible because he started with a table of customers known to be bicycle purchasers, and used a subset of it to effectively “train” the sieve to identify rows with similar properties. He also used the remainder of the known data set to test the sieve’s effectiveness.

This is almost the exact opposite of what I had presumed “Data Mining” was all about, but I stand corrected: “Data Mining” is an excellent description of the process.

You go into the project knowing what it is you want to find, and you have a test that can be used to prove if you are successful.

Digression: Arthur Ransome’s Pigeon Post

In the book Pigeon Post by Arthur Ransome, a team of child explorers decide to go prospecting for gold on the local hills. They base this decision on an old rumor of a lost gold seam up in the hills. Adventures ensue, of course. It’s a great book, with detailed descriptions of the prospecting, ore crushing, smelting, and charcoal burning processes. After an unsuccessful attempt to melt the gold dust into a nugget – it just oxides into a black lump – they know that just finding gold-colored ore isn’t enough: They need a test to prove the result to their uncle, Captain Flint, who has given up on fruitless treasure hunts:

“It probably wasn’t gold at all,” said Susan. “We ought to have made sure.”

Dick looked up suddenly. “There is one way we could,” he said. “We could try with aqua regia. You know, a chemical test. The book says gold dissolves in aqua regia. If it does, we’d know for sure.”

More ore is extracted, and later on Dick performs the test, working in Captain Flint’s workshop laboratory.

Dick’s tests show that the ore does indeed dissolve in aqua regia (a mixture of two strong acids), but his joy is short-lived when Captain Flint returns and reviews the results:

“What’s this? Chemistry?” asked Captain Flint.

“Gold,” said Dick.

“Gold? Don’t you go and get interested in the wretched stuff. I’ve sworn off. Had quite enough wasting time… What have you got in that test tube?”

Aqua regia,” said Dick. “And gold dust. And it’s gone all right. Dissolved! Gold in the aqua regia. I was just a bit afraid it might not be gold, after all.”

“But, my dear chap,” said Captain Flint. “Aqua regia will dissolve almost anything. The point about gold is that it won’t dissolve in anything else.”

Captain Flint examines the ore and, using his knowledge of the geology of the surrounding hills, comes up with a better test to determine what, exactly, the prospectors have found.

Captain Flint’s test reveals the ore to be copper pyrites, which in its way, turns out to be quite as good a discovery as gold would have been, and the adventure is a success after all.

There’s several insights we can pull from this story:

  • Choosing an appropriate “test for success” is critical.
  • Domain knowledge is very important, and the earlier you apply it, the better off you’ll be.

But here’s another thing: The prospectors weren’t looking for copper. They were looking for gold. And without some help, they weren’t ever going to understand what they had actually found.

This last point makes me think that there is something missing in the current Data Mining toolset.

Reductio ad memoria

Over a decade ago, I worked on a project with some truly kick-ass data visualization tools: Silicon Graphic’s MineSet. The project involved showing how a company could mix their own customer data with additional data sourced from a third-party, and throw it into a 3D visualization, allowing variation along five or more axes simultaneously.

For example, in addition to X, Y, and Z axis, we also had color, density, size of the “splat”, and probably some others that I can’t recall. The thrill of taking this 3D model and spinning it around various viewing angles to see correlations between data axes – in real-time, in a web browser! – was, frankly, mind-blowing at the time.

Now, in my opinion the project’s premise was flawed, because the nature of the third-party data didn’t really allow for a 1:1 join with the primary data set – it was more of an exercise in psychohistory.

But the tool was brilliant. It even had an Internet Explorer ActiveX viewer control for browsers.

One feature I would have liked to have seen was a “randomize axes” option, where you could cycle through the various available columns in your data set, mapping them to different aspects of the 3D view, and then spin the result around, looking for unexpected relationships.

Because it’s the unexpected that is truly insightful. I want to see the copper pyrites – or whatever else is in there – without having to explicitly test for it.

I don’t think this is Data Mining as it is understood today. But perhaps it might be called “Data Prospecting”.

Finally

A quick web search indicates that, in October 2003, the SGI MineSet product was licensed for development and sale to Purple Insight, a Gloucester, UK-based company that now appears to be defunct. This is a shame, because these visualization tools were unsurpassed by anything I’ve seen to date.

Lisa tells me that some of the underlying algorithms found their way into IBM’s Many Eyes project.

Note: By the way, all of Arthur Ransome‘s books about the young explorers are great, and I highly recommend them, for children of all ages.

Avoiding artificial inflation in SUM on a denormalized column value

Say you have a data model that looks like this:

In other words, time entries can be tagged against multiple tasks. (I know this seems wrong, but trust me, it made sense in the context of the system.)

The problem we had was that the complicated query behind the report of “total hours by task” was counting the hours twice or more, for any entry tagged to more than one task.

[code:c#]select T.NAME, sum(L.HOURS)
from time_x_task X
join timelog L on L.ID=X.TIME_ID
join task T on T.ID=X.TASK_ID
group by T.NAME; [/code]

After thinking about how I could alter the query so that each timelog.HOURS was included only once in the SUM(), it occurred to me that it might be better to just distribute the hours across the tasks:

[code:c#]select L.ID, T.NAME, L.HOURS, L2.cnt, (L.HOURS/L2.CNT) as ACTUAL_HRS
from time_x_task X
join task T on T.ID=X.TASK_ID
join timelog L on L.ID=X.TIME_ID
join (select count(1) as CNT, TIME_ID
from time_x_task
group by TIME_ID) L2 on L2.TIME_ID=L.ID;[/code]

Now we SUM(ACTUAL_HRS) instead of sum(HOURS).

OK, this is a contrived example, but the point I’m trying to make is that sometimes you have to throw away the original question and think about what you’re really trying to achieve.

A light-weight SQL editor plugin for Eclipse IDE

Eclipse is not my primary IDE, but I do find it very useful for editing ANT project definitions, especially the outline view.

I use generic projects in Eclipse Juno 4.2 to manage collections of DDL scripts, but I did miss SQL syntax highlighting. Well, no more!

I found this plugin on the Internet: http://blog.debertshaeuser.com/index.php?page=eclipse-sql-editor.

Just download the jar and put it in the eclipse/plugins folder, then re-start Eclipse. It works great:

Long columns wrapping in the SQL Result Grid

Have you ever experienced a moment when you find that an application you’ve used for years has a feature that, if only you’d known about it, would have saved you time over and over again? It’s a mixed feeling… one one hand, it’s “Yay, this software is awesome!” and the other it’s “Doh! I’m an idiot!”.

Well, PL/SQL Developer just handed me another of those moments.

In the result grid of the SQL Query window, you can resize any row by grabbing the separator line in the margin and pulling down. I think I knew that… but what didn’t realize is that any column with a long string will then wrap into the extra space, like this:

I think that is pretty cool. And it’s been in the product since at least version 8 (which is what I’ve used for the last few years).

I put on my robe and prophet hat

Six months ago, I went on record* as saying:

In eighteen months, no-one will be talking about “the Cloud”.

I may have to re-qualify that statement:

If your business plan has the word “cloud” in it, you’re doing it wrong.

The point here is that “cloud computing” is not Web 3.0, or a revolution in IT architecture, but much more about economics and risk.

As usual, Bob Lewis says it better and is more deserving of my Prophet Hat.

* ok, maybe it isn’t written down anywhere. But I said it out loud at least once.

Windows 7 can’t see SMB-based network attached storage without a hack

I’m sure it’s all in the name of greater security, but I don’t really care about that. My new Windows 7 studio computer could not back up my projects to my Network Attached Storage (NAS) box. Not a problem in XP, but in Windows 7, no matter what credentials you try, you get an error message “invalid username or password”.

Not very helpful, and in fact, misleading. Here’s what you need to do to enable connectivity:

Windows Control Panel > Administrative Tools > Local Security Policy

Then, in the hierarchical list of options on the left, select Local Policies > Security Options.

On the right side, scroll down the list and configure the following settings:

Network security: LAN Manager authentication level = Send LM & NTLM responses
Network security: Minimum session security for NTLM SSP based (including secure RPC) clients = No minimum (clear both checkboxes)

After this, I rebooted, although that was probably unnecessary.

I can now browse and copy files to the file shares on the NAS box. Hooray! 

(solution obtained from http://www.tomshardware.com/forum/75-63-windows-samba-issue)

Windows 7 : My Idea? Clearly not *grits teeth*

 

“I’m a PC and Windows 7 was totally my idea” – Microsoft Marketing Campaign, 2009

 

My Windows 7 out-of-box experience was less that satisfactory.

Actually, it sucked.

Once I got past the standard “Running Windows for the first time” wizard  – you know, entering a username, selecting the network type, etc etc – I was presented with a nice clean Windows desktop. Ahhh…

And then all hell broke loose.

First, Windows Update downloaded and installed some patches and prompted me for a reboot. Already? IT’S BEEN 15 SECONDS, PEOPLE! Give me a break.

At the same time, some Dell Backup software pops up and says I really should create recovery disks, and I really should do that immediately.

At the same time, some McAfee Anti Virus software pops up and says I should register immediately and get updated virus signatures, immediately, before I do anything else.

There’s more: Each of these frelling dialogs had at least three different possibilities for clicking on. Close boxes, Next buttons, hyperlinked text everywhere… What’s a regular person to do?

OK, so I chose Windows Update, and pressed Reboot Now. Seemed like the safe choice. Perhaps I was wrong. I don’t know.

Hey guess what – when do you that, and you log in after the reboot – those other dialog wizardly things don’t show up. It’s just like policemen, when you want them, you can’t find them.

So how do I create my recovery disks now? It turns out that you have to drill down into the Start menu – which no longer shows you a nice expanded menu of all the sh*t installed on your computer, by the way – and find the Dell Backup software, and poke around until you find the right combination of options that allows you to create the recovery disks.

It’s a serious deal, creating recovery disks. You have to put a DVD blank in the drive and then press GO and then don’t do anything until it’s finished, or the process might not be successul. So I set it going.

In the middle of this process, McAfee pops up again. I Can Has Registration? Pleeez?

Then Windows Update is back. You’ve Got Updates! GIVE ME A FRELLING BREAK.

*deep breath*.

After 20 minutes or so, I had my recovery disks, I’d uninstalled McAfee (DO NOT WANT, THANK YOU VERY MUCH) and set about configuring Windows Update to “Annoy only”. Well this brings me to the Microsoft Windows User Interface Designers:

Guys: Who decided that the Windows Configuration would be better served by a ‘Choose Your Own Adventure’ novel running in a Web Browser? YOU ARE RETARDED. YOU ARE SO RETARDED THAT YOU ARE NOT EVEN HUMAN.

Apparently your memory is so limited that you actually like going around and around in circles and getting nowhere because you don’t realize you are stuck in a loop. That’s the only explanation.

And don’t get me started about The Ribbon(TM). I noticed it in WordPad and was momentarily taken aback because I knew I hadn’t installed Office 2007. Alas, the RETARDS working at Microsoft managed to get their excuse for ignoring the last decade of computer user interaction standards wedged into Windows 7 into at least two of the built-in applications. Thank goodness I never actually use WordPad or Paint.

Enough ranting.

The Good: The stuff that is invisible.

I’m not talking about the frelling transparent windows when I say invisible. The transparent UI crap can all be disabled, thank goodness, and the Windows Classic theme is at least bearable. No, by ‘invisible’ I mean the multi-processing, multi-threading goodness that permeates the OS:  Inserting a disk into the optical drive no longer freezes Windows Explorer. Things like that. When it works, it works better than Windows XP. Most of the stuff under the hood is an improvement. Networking, hardware support, etc etc.

If only the UI people hadn’t messed around with what wasn’t broken. It’d actually be a pleasure to use.

“Because it’s there” may not be a good reason

I'm seriously thinking about going back to Windows XP on my little laptop here. For two reasons:

1. Ubuntu 9.04 / Gnome appears to run sluggishly. Is it the EXT4 file system? Is it some optimization I haven't discovered? I have no idea. Since part of the reason of giving Linux a go was to experience a lean, mean OS with no cruft, it seems just wrong to me. 

2. I'm not really having much fun. I thought it would be fun to dabble and learn and maybe even get MonoDevelop and ASP.NET working. Well it's not fun at all. It's tedious. There is much that I don't understand and what little time I have left over for this kind of activity is spent trying to find exactly the right blog post that explains something.

I want my comfort zone back.

Ubuntu Diary – Thunderbird 2 + Lightning 0.9

OK, so I installed the Lightning 0.9 add-on to Thunderbird in order to have Calender functionality, and I didn't notice until yesterday that I can't actually add any events to the calendar! The buttons and menu options are disabled.

This evening some quick googling turned up this reference to Bug# 278853 which is actually misleading because this isn't a bug in Thunderbird at all, but a dependency of Lightning 0.8 and 0.9 on a library that isn't installed on Ubuntu 9.4 by default.

Solution:

  • Uninstall Lightning 0.9 from Thunderbird's add-on dialog;
  • Go to Synaptic Package manager and search for "libstdc" and select to add the libstdc++5 package and apply changes;
  • Re-install Lightning 0.9

Result: I can now add an event to my calender reminding me to record the show "Being Human" on BBC America on the evening of July 25th.

« Older posts Newer posts »

© 2024 More Than Four

Theme by Anders NorenUp ↑