121 Commits

Author SHA1 Message Date
Matt Wells
7bd2344f41 increase regular page download timeout from 30
seconds to 60 seconds to accommodate some slower websites.
2016-05-18 10:05:02 -07:00
Matt Wells
d7a6a0a1ff fix gap.com redirects that require us
setting multiple cookies in the spider request.
2016-02-09 13:38:59 -08:00
Matt Wells
9247e15210 make spider use HTTP/1.1 not 1.0 since
some sites have been found to return
406 unacceptable periodically because of it.
2016-02-05 10:14:09 -08:00
Zak Betz
fc3ba95226 Fix host selection for downloading when nospider directives are present.
It was always choosing the first host sequentially with spidering
enabled.  Now it looks at the other hosts in the shard selected.
2015-11-29 21:36:19 -07:00
Zak Betz
9ff387a898 More fixes to prevent spider traffic from hitting hosts with nospider
directive.
Bug fix for msg20 lookups always being directed away from noquery hosts.
2015-11-13 15:03:02 -07:00
Matt
37cc4f2ba8 Merge branch 'diffbot-testing' into testing 2015-11-09 11:13:42 -07:00
Matt Wells
776b94396e a new ban msg for http status 503 2015-10-22 13:23:02 -07:00
Matt
39214a9dc6 Merge branch 'diffbot-testing' into testing 2015-10-02 19:26:15 -06:00
Matt Wells
93943a0cab some pages legitamately have no outlinks, no need to think
they were banned.
2015-09-24 14:01:23 -07:00
Matt
100888d691 fix file/dir creation permissions bugs 2015-09-21 12:44:41 -06:00
Matt
74cde33a3a just use the user's umask val for all file/dir creation 2015-09-21 11:33:38 -06:00
Matt
ce7b06fc4d all files made are now group writable.
if you don't like that then you can make
a special group and set the directory just
group writable for that group using chmod g+s <dir>.
2015-09-21 11:19:34 -06:00
Matt
8299197cca comments in <script> tags are a
convultion. deal with all four types and
their precedence issues. all of this is
to find the proper end of the </script> and
not a </script> or <script> that is being
printed out in the javascript in the <script> tag.
2015-08-28 16:31:22 -06:00
Matt
1badb8cd07 fix up hammer queue table print out on
sockets pages. make crawl delay link to the
robots.txt.
2015-08-25 11:07:54 -06:00
Matt
ea2c2d7190 show read buf of http sockets as well as the send buf
in the tool tip.
2015-08-25 10:53:16 -06:00
Matt Wells
7fcc2ab4e1 in the sockets table page,
show url download requests that are queued up to prevent
hammering an ip. also show the first 500 bytes of the send buf
in the http server sockets table.
2015-08-25 09:34:45 -07:00
Matt
a821d8bc41 Merge branch 'ia' into ia-zak 2015-05-05 23:46:16 -07:00
Matt Wells
d6bb5e98f8 more ban detect fixes 2015-05-04 14:40:52 -07:00
Matt
86800a0656 if a root/seed url has no outlinks, assumed banned. 2015-05-04 14:23:28 -07:00
Matt
a3672701f6 Merge branch 'diffbot-testing' into ia 2015-05-03 12:08:25 -07:00
Matt Wells
31e54df0c8 fix core from ::read returning 0.
do not do ban checks when autobackoff and autoproxies are disabled.
2015-05-03 10:17:36 -07:00
Matt Wells
821c6cb424 fix core 2015-05-03 00:33:47 -07:00
Matt Wells
2421bf3d1d ia checkpoint 2015-05-02 23:51:19 +00:00
Matt Wells
6d39bb5df8 added hack to log controls to avoid sending
msg13 (download requests) to host  until we fix
the streaming bug better.
2015-05-02 10:32:13 -07:00
Matt Wells
dccc1667ec added logdebugmsg13 to find out why urls are getting stuck
on host 0 in msg13 handler.
turn crawldelay backoff logic off by default until we fix
case of mis-detection on captchas and maybe some other things.
fix core when loading twitchy table on startup.
2015-05-01 13:19:45 -07:00
Matt
16bf1cf063 make auto back off a parm. i could see you'd want to
disable that if the ban/throttle detection is wrong.
2015-04-30 17:32:51 -07:00
Matt
656d89d98d spider proxy simplifications. remove global parms and just make
coll specific for now.
2015-04-30 17:18:57 -07:00
Matt Wells
20912541af remove global spider proxy parms for simpliclty 2015-04-30 17:06:35 -07:00
Matt Wells
3051310d16 fix core from bulk job ip ban detection in msg13 2015-04-30 16:59:02 -07:00
Matt
a15c9fd4c6 more fixes for auto proxies 2015-04-30 16:52:46 -07:00
Matt
7f1ac7460f fixes for auto backoff 2015-04-30 16:34:11 -07:00
Matt Wells
1825f6bd27 retry download if was in the twitchy table
at start of download, and not using proxies at all.
2015-04-30 16:06:13 -07:00
Matt
0970975a57 tested auto proxy use and auto spider (non-proxy) backoff to
3 second crawldelay successfully on the stamps site.
2015-04-30 15:31:09 -07:00
Matt
75c05ef9a9 twitchy updates 2015-04-30 14:18:23 -07:00
Matt
66db73f494 if we can't use any proxies and we detected a url as banned
then just use a crawldelay of 3 seconds.
2015-04-30 14:16:17 -07:00
Matt
6d8bb19962 checkpoint for auto proxy logic 2015-04-30 13:28:57 -07:00
Matt
bdff012152 checkpoint for auto-proxy logic. 2015-04-29 18:15:54 -07:00
Matt Wells
0656cc4c72 fix a core on seraph host 2015-04-22 15:46:35 -07:00
Matt
95e3a760e9 proxy fixes 2015-03-05 11:10:40 -08:00
Matt
db4fcb30f8 limit downloaded doc size to something
under the MAX_DGRAMS limit so msg13 won't core
trying to send the reply back.
2015-02-16 09:43:39 -07:00
Matt
6fc83566e2 more fixes 2015-02-02 14:06:38 -08:00
Matt
c15bd53e52 added support for supplying basic proxy authorization
to spider proxies. username:password@1.2.3.4:80
2015-02-02 13:23:38 -08:00
mwells
87285ba3cd use gbmemcpy not memcpy so we can get profiler working again
since memcpy can't be interrupted and backtrace() called.
2015-01-13 12:25:42 -07:00
Matt
c03ba31ec2 try to reduce log spam 2015-01-05 11:03:49 -08:00
Matt
6c5ca9162c quick fix for internal ip bug 2014-12-16 13:39:09 -08:00
Matt
329f004e74 compiler updates 2014-12-10 12:09:04 -08:00
Matt Wells
8e315504a2 fix empty rdbcache bug of not enough buf mem. 2014-11-27 13:17:00 -08:00
Matt
4e8a42e024 text replacements for bad int32_t substitutions 2014-11-17 18:24:38 -08:00
Matt
931a1c4bc6 good checkpoint. quite a few fixes. 2014-11-17 18:13:36 -08:00
Matt
4a0554c76f more 64bit fixes 2014-11-14 17:30:32 -08:00