Clustering Your Data


What clustering means: This program will look at two or more sets of experiments and find those spots and ORFs which have a high correlation. This means it looks for sets of genes which both have bright red spots, or sets which both have bright green spots, or sets where one experiment has bright red and another has bright green, etc. For example: in our experiment, we used the Red color to indicate our experimental conditions, and the Green to indicate control. But the experiment we clustered with did just the opposite: they used Green for experimental and Red for control. Therefore, if I wanted to find those genes which were overexpressed in experimental conditions, I would need to find areas where my slide was bright Red and the other slide was bright Green. What the clustering program does is put all of these color sets next to each other, so that by clicking on a node I can view at once all genes which are bright Red on my slide and bright Green on the other slide. If all of your experiments used Green for experimental conditions, you would want to find all those spots where the green color was bright for all of the experiments, and the clustering program would put all of these genes next to each other for easy viewing.

To Analyze data on SMD: (http://genome-www4.stanford.edu/MicroArray/SMD/)

The first thing to do is download Michael Eisen's data analysis software (http://rana.lbl.gov/EisenSoftware.htm). There is a link to his page on the SMD web page under software. You will have to register on his site, and this process is free for anyone using the software for non-profit purposes. The two programs you need to get are Cluster and Treeview, for Windows.

GETTING YOUR DATA:

Your data should be posted on the SMD home page, and you will need to find out where it is located. The easiest way to do this is to go to the public search button.

If you want to find your experiment alone:

Go to search by List Data and pick some parameter you can recognize. For example, I know my slide number is yO1n010, so I can go to List by Experiment ID and change the "sorted by header" category to "SlideName."

From there I can peruse the data and find my experiment.

Once I locate my slide I click on the Experiment ID link and it will bring me to a page where I set up my data retrieval parameters. Because my experiment used the Red color for the experimental conditions, I chose to sort by R/G Normalized Mean, descending. This means I will get a list of genes which have the brightest red color (i.e., the experimental conditions caused these genes to be expressed over the control) and the list will show this data from the brightest (most expressed) to the least bright (least expressed).

Under display, you can pick those elements you want to find in your spreadsheet, like gene name, process, and function. You can highlight those categories which are not adjacent by holding down the control key. You can also filter your data if you want, and adjust it to, say, only show spots that were flagged, or to only show numbers if they are above a certain value.

From here you click Submit and a web page will come up with your assorted data. This page can also be downloaded if you click on "make downloadable file of all data" before hitting submit, and can be saved as an Excel file.

If you want to cluster your data with other experiments:

Go back to public search and click on Basic Search. Pick a results type to search in, for example, Experiment.

From here you can find your organism (in my case, S. cerevisiae) and find your group where your data has been posted (in my case, GCAT).

To cluster, you will need to hit "Data Retrieval and Analysis" NOT "Display data." This will come up with the individual names of experiments, and you can chose which experiments you wish to cluster together (in my case, I would highlight the two Swarthmore college experiments).

Once you have selected the experiments you wish to cluster, you hit the Data Retrieval and Analysis button again.

This will bring you to a page where you set your data retrieval parameters, which you should fill out as follows:

From there, submit your query. It may take some time to load.

When its done, click on "Download Preclustering File." This will take you to your data page, and you will need to save this page as a text file.


CLUSTERING YOUR DATA:

Open the cluster program.

Click on Load File and find the text document you just saved. It will tell you that it cant be opened; ignore it.

Under filter genes, click on the % Present box, and decide what percent of data you want to look at. The higher your %, the more genes will get cut out, and they are eliminated by lack of brightness. 80% is a good place to start.

Then click on the Filter button. The program will then tell you how many genes passed through your filter. In my case, I used 80% as my filter, and it showed about 1200 out of 6000 genes. Once you have decided on the number of genes you want to look at, hit the accept button. If you don't hit the accept button, it will try to do all your data and will take much more time.

Next click on the Hierarchical Clustering Tab. Here you can decide whether or not to cluster your arrays as well as your genes. The cluster box under genes should be checked, but NOT the Calculate weights box. If you want to cluster your arrays (I didn't) then you should make sure the Cluster box under Array is checked, otherwise make sure it is unchecked, again, don't click on the Calculate weights box.

When you are all set hit the Average Linkage Clustering button and the program will cluster your data. This can take a long time. When it is done, it will create two files, a cdt file (this is for Treeview) and a gtr file.

VIEWING YOUR DATA:

Open the Treeview program. Under File, select Load and find your cdt file.

When it opens, you will be presented with a picture view of your clustered data on the left. If you click on the lines in the pictures, all genes past that node will appear on the right side of the screen with the gene name, function, and process, as well as a close-up view of your experiments side by side. In addition, you can drag over genes with your mouse and they will appear on the right, or you can click on individual spots (genes) and look at their information.

From here, you just peruse your data and see what you find!


Advice from Barbara Dunn at SMD (Standford Microarray Database)

To cluster your expt with other expts, go to basic search, click "experiments" and then select "Saccharomyces cerevisiae" and "GCAT". Then you should press the button labelled "Data retrieval and analysis" (Not "Display Data"). Then you will get a list of all the GCAT expts, and you can highlight (using the "Ctrl" key on your computer) the specific expts that you want to cluster with each other. Then you press the "Data retrieval and analysis" button, and you should get the form that I described in my previous e-mail. This will give you a downloadable file that is ready to open up in the "Cluster" program.

Clustering Directions

Just getting back to you on this question...sorry for the delay. Anyway, I don't cluster from SMD (I don't know if that is what you tried to do?). I use Mike Eisen's Cluster program. It takes quite a bit of computer power, so I hope that it will work for you (if not, let me know and I will ask the SMD curators to help with clustering through SMD). Here's what I do (hope it helps!):

1. Retrieve the data from the experiments that you want to cluster. First you have to fill in the SMD Data Retrieval and Analysis form using the Data Retrieval parameters that I describe below.

2. At the bottom of the SMD data retrieval form, select the "Download raw data" option (or something to that effect...the database is down so I can't look at the page!). You will get a web page of the data text file, and using the browser, you just save the data onto your PC as a text file.

3. The data file will be called a .pcl (pre-clustering) file. You can now use this file directly to cluster, using Mike Eisen's Cluster program.

4. Start up Cluster, and load the .pcl file (you may get a warning after
it loads, but ignore it).

5. Filter the data: Under the tab Filter Data, you should check the box that says "% present = 80". What this does is to only cluster genes for which 80% (or more) of the experiments have data for that gene. You've already filtered out bad spots when you filtered for Intensity/Background under Data Retrieval parameters, so this filter just means you won't get a lot of grey boxes (meaning "no data") across your cluster.

6. Hit the button that says "Filter". This will report to you how many of the genes made it through the filter. If you want to include more genes, you can lower the % Present value to 75% or 50% or whatever, and just keep hitting the "Filter" button. When you like the cutoff results, you can then hit the "Accept" button.

7. Go to the Hierarchical Clustering tab. You will probably want to unclick the Arrays cluster option (this will put all the experiments that are similar to each other into hierarchical clusters), but if you've got the computer power (and/or a small amount of data to cluster) it's kind of fun to do your clustering either with genes alone, arrays alone, or in both directions. Don't click on the Calculate Weights.

8. Hit the "Average Linkage Clustering" button. It will probably take a LOOONG time to cluster if you've got a pretty big data set.

9. The program will produce 2 to 3 files when it's done: a .cdt file (the bulk of the data), plus an .atr file (the array nodes, if you clustered arrays) and/or a .gtr file (the gene nodes, if you clustered genes).

10. You can then open up your .cdt file in TreeView and cruise through the cluster!

Data Retrieval Parameters

For Gene Selection Options, I just use the defaults as they come up. For Gene Filtering I unclick the box that says "Use one of:" (in other words, I do not do any gene filtering during data retrieval). For Biological Data to Select, I leave the default as choosing from Oracle, and I highlight "Gene Name", "Process" and "Function" in the scroll-menu. For Data Selection Options, I only change the Data Filtering fields-- I keep the first option checked as "active" and I change it from Regression Correlation to CH2IN/CH2BN and set it "gt" "1.2". Then I choose the second option checked as "active" and set it the same as the first, except use CH1I/CH1B. What this Data Filtering does is to only retrieve data for spots where the spot intensity is at least 1.2 times the background intensity. This is a pretty non-stringent threshold; sometimes I go to 1.3 or 1.5, but that will probably cut off too many of your spots. You can play around with any of the parameters, or even collect the data without any data filtering.

**NOTE--all of the parameters that I use for Data Retrieval and Clustering are not set in stone; feel free to play with them! However, I haven't played around with them that much, so I'm not real familiar with them and probably can't advise you intelligently.


GCAT Protocols

GCAT Home Page

Biology Home Page


© Copyright 2001 Department of Biology, Davidson College, Davidson, NC 28036
Send comments, questions, and suggestions to: macampbell@davidson.edu