Matt
1e248218da
added 4 more diffbot errors so hopefully
...
no more 'unknown diffbot error' error codes
in crawlbot.
2016-01-11 16:12:33 -08:00
Matt
032f597a16
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
...
Conflicts:
Errno.cpp
Errno.h
2016-01-11 15:30:53 -08:00
Matt
33a40480b4
try to fix Unknown Diffbot Error error.
2016-01-11 15:28:30 -08:00
Matt
80991c943f
complete merge of ia code into testing.
...
make indexing warcs/arcs a switch in spider controls.
2015-11-09 12:46:06 -07:00
Matt
20da5753f4
added switch to just return error if not all shards
...
returned results because a shard was dead!
the 'all or nothing' switch.
2015-09-11 13:43:49 -06:00
Matt
c1ec4dedbb
fix for bad query formation.
...
text:""foo bar""
2015-08-02 11:34:55 -06:00
Matt
1327301a8d
Merge branch 'testing' into ia
...
Conflicts:
Errno.cpp
Errno.h
2015-07-01 19:03:56 -06:00
Matt
902a8fc61d
fix errno mismatch bug
2015-06-18 10:33:27 -06:00
Zak Betz
e399a8b0aa
Add qa test for arc and warc files. Change XmlDoc to use timeaxis url
...
when creating the titlerec key instead of the firsturl.
2015-05-21 15:19:33 -06:00
Matt Wells
f90ebfd1d6
fix issue of not adding a spider status doc
...
when the dns lookup failed on a fakefirstip spider request.
also increment crawl attempts counter.
2015-05-06 10:47:27 -07:00
Matt Wells
050206d5dc
fix core from resetting the url filters
...
when a url was about to be spidered.
2015-05-05 09:39:45 -07:00
Matt Wells
e346a14a47
added logic to retry diffbot reply on connection reset,
...
connection timed out or gateway timed out (http status 504)
msgs. added logic to detect truncated json (missing final })
and not print it. also, at index time, we set a diffbot missing
curly error to g_errno so the whole url can be retried later.
2015-03-09 20:54:34 -07:00
Matt Wells
38caa517f2
add switches to disable injections or querying
...
from the master controls, for all collections.
2015-03-04 10:49:37 -08:00
Matt Wells
ee672bb3a3
say 'Injected url already indexed' and not 'Injection abandoned'
...
for clarity's sake.
2015-02-26 16:04:51 -07:00
Matt Wells
c8c56a24da
fixed query reindex for diffbot json docs.
...
added recycle content checkbox to query reindex.
fix gbsortbyint: at end of query core.
only show 'all spiders paused' msg for active jobs.
show error summaries if doc not found and &showerrors=1.
2014-12-15 16:49:20 -08:00
Matt
4e8a42e024
text replacements for bad int32_t substitutions
2014-11-17 18:24:38 -08:00
Matt
931a1c4bc6
good checkpoint. quite a few fixes.
2014-11-17 18:13:36 -08:00
Matt
96b8197ad3
now it compiles with -m32
2014-11-10 14:45:11 -08:00
mwells
754d5b4755
rename admin.html to faq.html etc. file juggling.
2014-08-31 09:51:21 -07:00
Matt Wells
b393a1bbbe
Merge branch 'testing' into diffbot-matt
...
Conflicts:
Errno.cpp
Errno.h
2014-07-10 10:06:55 -07:00
mwells
5bbdb8e172
got page add url and add url api working.
2014-07-09 20:32:30 -07:00
mwells
d9ae010371
shard gbfacetstr:gbxpathsitehash123456 terms by termid for speed.
...
got them working again multicasting a msg 0x39 to the appropriate shard.
set special msg39request flag for better performance for those guys.
2014-07-07 12:32:27 -07:00
mwells
6434e5cc04
Merge branch 'testing' into diffbot-matt
...
Conflicts:
Errno.cpp
Errno.h
Parms.h
2014-07-07 09:49:59 -07:00
mwells
43d0d636ee
fix dmoz building.
2014-07-05 22:20:15 -07:00
mwells
29d170631a
more api updates
2014-07-05 12:36:01 -07:00
Matt Wells
886063a3bd
fixes for query reindex.
2014-07-03 12:24:14 -07:00
Matt Wells
1361e5728c
show actual diffbot error in urls.csv.
...
do not stop indexing page and harvesting links on diffbot error.
2014-07-02 11:53:24 -07:00
mwells
92799ef393
add support for tunnelling https fetch
...
through an http proxy using CONNECT
directive. needs more debugging.
2014-07-01 10:43:52 -06:00
Matt Wells
27ffd23345
handle boolean query overflow errors better.
2014-06-10 17:21:55 -07:00
Matt Wells
72c6d032d8
fix query reindex on subdocuments (diffbot json blurbs)
...
so that they just put in a spiderrequest to reindex
the parent url. Added &diffbotreply= to the injection
interface so dan can provide that along with the
pageUrl he passes in with &u=
2014-05-15 14:11:12 -07:00
Matt Wells
82726879a2
support base64 generated thumbnails in serps.
2014-04-24 14:04:57 -07:00
mwells
be99155986
more updates
2014-04-09 11:03:31 -07:00
Daniel Steinberg
7b5816f194
updated error message
2014-03-12 20:56:27 -07:00
Daniel Steinberg
c81bbf6934
more informative error message
2014-03-11 18:10:21 -07:00
Daniel Steinberg
14c1b2efa3
more informative error message
2014-03-11 18:06:42 -07:00
Daniel Steinberg
2331b4673d
Defect #2099 : throw an error a crawl request was made with a name that already existed for bulk request (or the other way around)
2014-03-11 16:21:58 -07:00
Matt Wells
8876dae984
added and fixed support for <link ahref=xxx rel=canonical>.
...
treat those as simplified meta redirects.
updated spider dedup documentation in developer.html file.
2014-01-30 10:37:59 -08:00
Matt Wells
6a45e42128
added ability to treat <link xyz.com rel=canoical> as meta redirects.
...
should help us dedup.
added a function to do looser deduping of spider pages although current
not enabled, we are still using the more strict one.
added documentation on how we dedup to developer.html for jon to
take a look at.
2014-01-30 10:04:09 -08:00
Matt Wells
f64b53bfb3
almost done with rebalancing code
2014-01-10 14:12:58 -08:00
Matt Wells
b0d77a834a
do not spider fake ips requests, just re-add them
...
with the right firstip
2013-12-20 12:22:02 -08:00
Matt Wells
6495dfd86e
try to fix json parser overflow error. needs
...
testing. tried to fix round num from incrementing
for little job because i think server overload.
should be fixed right some time. just made wait
time 30 secs instead of 10 in Spider.cpp.
2013-11-15 11:30:16 -08:00
mwells
9bf8bf7712
add spider reply even on g_errno now with an error
...
code of EINTERNAL error in the spider reply.
no longer just sit on the lock. this was blocking
an entire ip when just lock sitting for 3 hrs.
and only do read rate timeouts if there was at least
one byte read. this was causing diffbot reply to
read rate timeout after just 60 seconds even though
its timeout was specified as 90 seconds.
2013-09-29 09:22:20 -06:00
Matt Wells
df96f81e78
fix spidering and other things.
2013-09-16 11:22:07 -07:00
Matt Wells
a50898649b
various fixes.
2013-09-16 10:16:49 -07:00
Matt Wells
5dc7bd2ab4
integrate diffbot from svn back into git.
2013-09-13 09:23:18 -07:00
Matt Wells
f6e560c1f4
Initial file population.
2013-08-02 13:12:24 -07:00