10 Commits

Author SHA1 Message Date
Matt
a54471849b sitemap.xml support for harvesting loc urls.
parse xml docs as pure xml again but set nodeid
to TAG_LINK etc. so Linkdb.cpp can get links again.
added isparentsitemap url filter to prioritize urls
from sitemaps. added isrssext to url filters to
prioritize new possible rss feed urls. added numinlinks
to url filters to prioritize popular urls for spidering.
use those filters in default web filter set.
fix filters that delete urls from the index using
the 'DELETE' priority. they weren't getting deleted.
2015-03-17 14:26:16 -06:00
Matt
931a1c4bc6 good checkpoint. quite a few fixes. 2014-11-17 18:13:36 -08:00
Matt
96b8197ad3 now it compiles with -m32 2014-11-10 14:45:11 -08:00
Matt Wells
e7dd8f7956 replace long long with int64_t 2014-10-30 13:36:39 -06:00
Matt Wells
b13f3d24d7 replaced unsigned long long with uint64_t 2014-10-30 13:30:39 -06:00
mwells
15756ec94a Merge branch 'diffbot-testing' into testing 2014-07-14 18:10:13 -07:00
mwells
a72c5dae51 fix <script> tags that immediately end in </script> or
never end but hit another <script> or a </gbiframe> tag.
2014-07-14 17:24:20 -07:00
mwells
e22641997a fix geth1tag some more.
fixed bad comment tag detection. was losing
a good deal of some pages because of that.
2014-07-07 08:20:21 -07:00
Matt Wells
261f4feb9b fixed cdata parsing issue 2013-12-19 16:04:53 -08:00
Matt Wells
f6e560c1f4 Initial file population. 2013-08-02 13:12:24 -07:00