open-source-search-engine

Author	SHA1	Message	Date
Matt	a54471849b	sitemap.xml support for harvesting loc urls. parse xml docs as pure xml again but set nodeid to TAG_LINK etc. so Linkdb.cpp can get links again. added isparentsitemap url filter to prioritize urls from sitemaps. added isrssext to url filters to prioritize new possible rss feed urls. added numinlinks to url filters to prioritize popular urls for spidering. use those filters in default web filter set. fix filters that delete urls from the index using the 'DELETE' priority. they weren't getting deleted.	2015-03-17 14:26:16 -06:00
Matt	931a1c4bc6	good checkpoint. quite a few fixes.	2014-11-17 18:13:36 -08:00
Matt	96b8197ad3	now it compiles with -m32	2014-11-10 14:45:11 -08:00
Matt Wells	e7dd8f7956	replace long long with int64_t	2014-10-30 13:36:39 -06:00
Matt Wells	b13f3d24d7	replaced unsigned long long with uint64_t	2014-10-30 13:30:39 -06:00
mwells	15756ec94a	Merge branch 'diffbot-testing' into testing	2014-07-14 18:10:13 -07:00
mwells	a72c5dae51	fix <script> tags that immediately end in </script> or never end but hit another <script> or a </gbiframe> tag.	2014-07-14 17:24:20 -07:00
mwells	e22641997a	fix geth1tag some more. fixed bad comment tag detection. was losing a good deal of some pages because of that.	2014-07-07 08:20:21 -07:00
Matt Wells	261f4feb9b	fixed cdata parsing issue	2013-12-19 16:04:53 -08:00
Matt Wells	f6e560c1f4	Initial file population.	2013-08-02 13:12:24 -07:00