All Apache Nutch distributions is distributed under the Apache License, version 2. The link in the Mirrors column below should display a list of available mirrors with a default selection based on your inferred location. If you do not see that page, try a different browser. The checksum and signature are links to the originals on the main distribution server.
It is essential that you verify the integrity of the downloaded files using the PGP or MD5 signatures. Additionally, you can verify the SHA signature on the files. A Unix program called shasum or shasum is included in many Unix distributions. Older releases used the MD5 signature. You may use the Unix program md5 or md5sum to verify the MD5 signature:. If something doesn't work for you try searching and sending a message to the Nutch or Hadoop users mailing list.
Suggestions or tips are welcome. Why not add them to the end of this Wiki page? First let me layout the computers that we used in our setup. To setup Nutch and Hadoop we had 7 commodity computers ranging from Mghz to 1. I am telling you this to let you know that you don't have to have big hardware to get up and running with Nutch and Hadoop.
Our computers were named like this:. Our master node was devcluster By master node I mean that it ran the Hadoop services that coordinated with the slave nodes all of the other computers and it was the machine on which we performed our crawl and deployed our search website. Both Nutch and Hadoop are downloadable from the Apache website.
The necessary Hadoop files are bundled with Nutch so unless you are going to be developing Hadoop you only need to download Nutch. We built Nutch from source after downloading it from its subversion repository. Nightly builds of Nutch can be found here:. You can get a packaged tarball or extract from subversion.
Knowing how to use tar or subversion is outside of the scope of this tutorial. Once you have a subversion client you can either browse the Nutch subversion webpage at:. I am not going to go into how to install java or ant, if you are working with this level of software you should know how to do that and there are plenty of tutorial on building software with ant.
It is worth noting that previous versions of Nutch came already built. But nowadays the release is just source code and so does have to be built before use. Once you have Nutch downloaded and unpacked look inside it where you should see the following folders and files:.
Add a build. So if you are building on a linux machine it would look something like this:. This step is actually optional as Nutch will create a build directory inside of the directory where you unzipped it by default, but I prefer building it to an external directory.
You can name the build directory anything you want but I recommend using a new empty folder to build into. Remember to create the build folder if it doesn't already exist. This should build nutch into your build folder. When it is finished you are ready to move on to deploying and configuring nutch. Once we get nutch deployed to all six machines we are going to call a script called start-all.
This means that the script is going to start the hadoop daemons on the master node and then will ssh into all of the slave nodes and start daemons on the slave nodes. The start-all. It is also going to expect that Hadoop is storing the data at the exact same filepath on every machine. The way we did it was to create the following directory structure on every machine. The search directory is where Nutch is installed. The filesystem is the root of the hadoop filesystem.
The home directory is the nutch users's home directory. On our master node we also installed a tomcat 5. I am not going to go into detail about how to install Tomcat as again there are plenty of tutorials on how to do that.
I will say that we removed all of the wars from the webapps directory and created a folder called ROOT under webapps into which we unzipped the Nutch war file nutch This makes it easy to edit configuration files inside of the Nutch war. So log into the master nodes and all of the slave nodes as root.
Create the nutch user and the different filesystems with the following commands:. Again if you don't have root level access you will still need the same user on each machine as the start-all. It doesn't have to be a user named nutch user although that is what we use.
Also you could put the filesystem under the common user's home directory. Basically, you don't have to be root, but it helps. For this we are going to have to setup ssh keys on each of the nodes. Since the master node is going to start daemons on itself we also need the ability to user a password-less login on itself.
You might have seen some old tutorials or information floating around the user lists that said you would need to edit the SSH daemon to allow the property PermitUserEnvironment and to setup local environment variables for the ssh logins through an environment file. This has changed. We no longer need to edit the ssh daemon and we can setup the environment variables inside of the hadoop-env. Open the hadoop-env. Below is a template for the environment variables that need to be changed in the hadoop-env.
There are other variables in this file which will affect the behavior of Hadoop. There is a section below on how to do this. Next we are going to create the keys on the master node and copy them over to each of the slave nodes. This must be done as the nutch user we created earlier. Don't just su in as the nutch user, start up a new shell and login as the nutch user. If you su in the password-less login we are about to setup will not work in testing but will work when a new session is started as the nutch user.
You only have to run the ssh-keygen on the master node. On each of the slave nodes after the filesystem is created you will just need to copy the keys over using scp. You will have to enter the password for the nutch user the first time. An ssh prompt will appear the first time you login to each computer asking if you want to add the computer to the known hosts. Answer yes to the prompt. Once the key is copied you shouldn't have to enter a password when logging in as the nutch user.
Test it by logging into the slave nodes that you just copied the keys to:. Once we have the ssh keys created we are ready to start deploying nutch to all of the slave nodes. Note: this is a rather simple example of how to set up ssh without requiring a passphrase. There are other documents available which can help you with this if you have problems. It is important to test that the nutch user can ssh to all of the machines in your cluster so don't skip this stage.
First we will deploy nutch to a single node, the master node, but operate it in distributed mode. This means that it will use the Hadoop filesystem instead of the local filesystem.
We will start with a single node to make sure that everything is up and running and will then move on to adding the other slave nodes. All of the following should be done from a session started as the nutch user. We are going to setup nutch on the master node and then when we are ready we will copy the entire installation to the slave nodes.
First copy the files from the nutch build to the deploy directory using something like the following command:. When we were first trying to setup nutch we were getting bad interpreter and command not found errors because the scripts were in dos format on linux and not executable.
Notice that we are doing both the bin and config directory. In the config directory there is a file called hadoop-env. Also it is recommended to make a copy of the index for Tomcat, so that you can crawl and update your index independently. Evaluate Confluence today. Pages Blog.
Child pages. Archive and Legacy. Browse pages. A t tachments 0 Page History People who can view. Copy Page Tree. Pages Home Archive and Legacy. Jira links. Requirements Java 1. Nutch 0. Apache's Tomcat 5. On Win32, cygwin, for shell support. If you plan to use Subversion on Win32, be sure to select the subversion package when you install, in the "Devel" category.
0コメント