Over the past year I’ve been impressed with how much the Indieweb.org Wiki has improved. Heck, it was good when I first saw it but members are very active and keep editing, improving, adding on and tweaking, relentlessly so it just gets better.
Somebody in one of the Indieweb chat channels, recently suggested the Indieweb wiki would make a good seed site (aka starter crawl) if one were starting to build a new search engine index. This is something I’ve been thinking about, off and on, for a couple of weeks and I have to say I agree: the Indieweb.org wiki would make a good seed site for a web search engine.
What’s a “seed site”?
Briefly, a seed site, or starter crawl, is a site (or one site of several) that a search engine crawler would index to find a wide variety of worthwhile pages to index. The crawler would index those URL’s found and then index more pages on those sites and in the process discovering URL’s that they lnk to and on and on. In the old days the Yahoo directory and Dmoz directory were considered prime seed sites for search engines. Later Wikipedia came along and is still considered an important seed site for outbound links. These sites were considered prime starters, in part because the outbound links had all been reviewed by human editors so a certain level of quality could be presumed.
If you want to learn more about seed sites I suggest reading Bill Slawski’s article: Seed Sites for Search Engine Web Crawls, which is worth reading if you are interested in the topic. The comments are worth reading too.
Why Use the Indieweb.org Wiki?
First, there is a wealth of good information on the wiki pages alone even without crawling the outbound links. Second, crawling the outbound links.
Now I would not use the Indieweb wiki as my only seed unless I were, somehow, creating an Indie Web only search engine but I think it would be good as part of a mix of other seed sources. The Web has changed in the last 20 years, commercial sites have taken over and the wiki’s outbound links lend a certain, needed counter balance to the commercial.
- The outbound URL’s are human curated. This minimizes low quality content.
- Links to some quality content that might take awhile to find by other means. Namely, a lot of quality blogs which also link out freely.
- Wiki pages act as tags. This isn’t quite as useful for a search engine as a full directory taxonomy of categories, it is useful.
- The wiki is not outrageously huge. Make no mistake, it’s big and growing bigger, but it’s dwarfed by something like Wikipedia and easier to digest within bandwidth limits of a startup.
- It’s constantly being updated. This makes it a good source for re-crawls because new links are constantly being added.
Other Good Seed Sites:
- Curlie.org – is the successor to the old Dmoz (Open Directory Project). Volunteer editors have been working on cleaning out dead links for a couple of years and possibly adding new listings so it’s not quite as dead as one might think. For somebody starting a web search engine, it’s hard to ignore 3 million or more listings. Said listings may be older sites but I’d gamble that the quality is better than links from Twitter or Facebook would be. Plus that taxonomy. I would not spend time re-crawling for new URL’s after the initial starter crawl.
- Wikipedia – they don’t quite link out as freely as they once did but this is much more up to date.
- Reddit – or at least large parts of Reddit. It’s big and diverse. Sub-Reddits act as tags. Constantly expanding with new links. Helps you determine what is new and popular. This is a good place to start a crawl and to re-crawl for new links. Reddit was suggested to me by some very experienced SEO’s when we were discussing this topic. I trust their judgment.
- Indie Map – Maybe. I’d include it as a starter but would skip re-crawling.
- Hacker News – Maybe for a seed crawl. I would try to tap HN including the comments for new fresh links.
- Pinboard – constantly updating bookmarks. This would make a good seed site.
Agree or disagree on Indieweb.org as a seed site? Can you think of other seed sites I’ve missed? If so, leave a comment below. Thanks.
This was also posted to
Liked this post? Follow this blog to get more.
@bradenslen I had no idea what a seed site was, but this makes sense. Interesting!
@bradenslen Thanks for this, Brad. I continue to learn about such a fundamental topic mostly by way of your posts. 👨🎓