39d9760e5d
added ismedia url filter to cover all the jpg,gif,mpeg,css rules.
Matt Wells
2013-09-18 09:40:59 -07:00
c77453348f
Merge branch 'master' into diffbot
Matt Wells
2013-09-18 09:23:48 -07:00
d6815f2c9d
if family filter enabled (&ff=1) then prepend "gbadult:0 |" to the query to restrict to non-adult pages.
mwells
2013-09-18 00:11:55 -06:00
a0032e0eb7
added another log statement for when debugging the adult content detectory. we err on the side of caution for the most part.
mwells
2013-09-18 00:06:21 -06:00
fc692202ba
fix integration of urls filters into crawlbot page
Matt Wells
2013-09-16 16:27:48 -07:00
e7ed9254d4
formatting...
Matt Wells
2013-09-16 15:33:45 -07:00
1a780d1f4a
pretty up a little
Matt Wells
2013-09-16 15:18:55 -07:00
a034604cef
clean up to remove g_conf.m_useDiffbot
Matt Wells
2013-09-16 15:00:43 -07:00
cb9969ad22
fix token bug
Matt Wells
2013-09-16 14:38:29 -07:00
3dfba4de69
doc updates
Matt Wells
2013-09-16 14:29:01 -07:00
4c11265a98
more updates to crawlbot api
Matt Wells
2013-09-16 13:59:11 -07:00
676437c3c4
more universal api updates.
Matt Wells
2013-09-16 11:42:04 -07:00
04f7774543
lower spider crawl info stats threshold
Matt Wells
2013-09-16 11:27:09 -07:00
df96f81e78
fix spidering and other things.
Matt Wells
2013-09-16 11:22:07 -07:00
f974d6a47b
fixes for crawlbot universal api.
Matt Wells
2013-09-16 10:49:37 -07:00
a50898649b
various fixes.
Matt Wells
2013-09-16 10:16:49 -07:00
9db501d91c
resolve merge conflict for nullTerm()
Matt Wells
2013-09-16 09:06:33 -07:00
78a334198b
Merge branch 'master' into diffbot
Matt Wells
2013-09-16 09:05:37 -07:00
3ac79de92e
fix type adurl -> addurl.
Matt Wells
2013-09-16 08:11:06 -07:00
e6f87f5049
do not send email alerts to sysadmin@gigablast.
Matt Wells
2013-09-16 08:10:18 -07:00
5deda56ede
minor documentation updates.
Matt Wells
2013-09-15 22:16:14 -07:00
3fdbae4b05
admin.html documentation update.
Matt Wells
2013-09-15 22:05:01 -07:00
68db2e6cc6
fix bug when checking the delete checkbox on the injection page.
Matt Wells
2013-09-15 21:47:42 -07:00
965e23f192
fix core from hashtablex::set() not getting enough buf space. now we force it to allocate a minimum of 32 slots to fix another bug where it was re-allocating immediately upon adding a key because growTable() is ALWAYS called if there are less than 20 slots!
Matt Wells
2013-09-15 21:15:58 -07:00
991e2f30f7
speed up whitelist hashtable like 20x using hashtable key magic.
Matt Wells
2013-09-15 21:10:53 -07:00
928dc36a03
get "&site=abc.com+xyz.com"... working to restrict search results to specified sites. tested a little.
Matt Wells
2013-09-15 20:16:48 -07:00
2211881e59
take apt-get install ssl stuff out of admin.html installation instructions since we supply the ssl headers now.
mwells
2013-09-15 18:27:47 -06:00
01c2a6d381
we already include our own 32-bit libssl.a and libcrypto.a so we can ensure stability. so we have to include the header files as well really.
mwells
2013-09-15 18:25:49 -06:00
107037c6a2
new &sites=xyz.com+abc.com+... functionality compiles ok.
mwells
2013-09-15 18:14:32 -06:00
b684414e16
almost done adding support for whitelists. i.e. list of sites to restrict search results to, for instance.
mwells
2013-09-15 15:15:56 -06:00
7ecffec40f
universal api updates
Matt Wells
2013-09-13 18:10:03 -07:00
d982997b0c
streamline crawl stats.
Matt Wells
2013-09-13 17:34:39 -07:00
93ce424d99
start working on the main gui for crawlbot which is /crawlbot
Matt Wells
2013-09-13 16:22:07 -07:00
6b330da240
cleanup warnings in log.
Matt Wells
2013-09-13 14:37:35 -07:00
eb65b9265d
call diffbot /api/analyze if classify is true or api = "all" now. it will return "type": in the json to indicate page type. basically, it classifies the page.
Matt Wells
2013-09-13 14:13:56 -07:00
19056fc3f2
show "processed" instead of "matched". other fixes for spider stats. add new crawl stats. attempts and successes.
Matt Wells
2013-09-13 11:51:55 -07:00
e3e6551e23
fix diffbot bugs.
Matt Wells
2013-09-13 11:34:40 -07:00
7dd647c222
trying to fix nukeJSON code.
Matt Wells
2013-09-13 10:38:34 -07:00
ef3990da98
now when re-indexing an existing xml doc we first call nukeJSONObjects() to delete any pages that it indexed from the json objects it had in its diffbot reply. in this way it can then re-add them if its new diffbot reply has them again.
Matt Wells
2013-09-13 10:00:21 -07:00
a412c798bf
Merge branch 'master' into diffbot
Matt Wells
2013-09-13 09:24:28 -07:00
5dc7bd2ab4
integrate diffbot from svn back into git.
Matt Wells
2013-09-13 09:23:18 -07:00
e152205765
make depend update
mwells
2013-09-09 02:37:47 -06:00
1d63aa936c
remove plotter.h includes causing compiler errors on some machines.
Matt Wells
2013-09-09 01:25:00 -07:00
76b390aea2
fix typo
Matt Wells
2013-09-08 19:51:57 -07:00
d930a833cc
try to fix compiler error related to bad delete function override. added "throw()" before the first "{" in the function body.
mwells
2013-09-08 20:15:39 -06:00
828345a4c7
fix compiler warning in types.h.
mwells
2013-09-08 20:00:52 -06:00
e1968b2237
Merge branch 'master' of git@github.com:gigablast/open-source-search-engine
Matt Wells
2013-09-08 18:43:00 -07:00
cecc655eac
minor ifdef fix.
Matt Wells
2013-09-08 18:42:44 -07:00
657d669ec8
exclude events and seo functionality. most people want this for web search so it should be a non-issue.
mwells
2013-09-08 17:07:42 -06:00
34b6d3e74a
fixed some cores. brought in fixes from old repo.
mwells
2013-09-08 16:16:13 -06:00
dcf45dd69d
dump out doledb to disk when it has more than 50,000 negative keys to avoid positive/negative key annihilations delays.
mwells
2013-09-08 15:09:54 -06:00
03706131fe
documentation updates in Spider.h.
Matt Wells
2013-09-08 13:42:02 -07:00
54c9353dbd
try to fix core from g_inSigHandler being set. it should never be set since we do not use real time signals any more
mwells
2013-09-08 12:34:37 -06:00
0581f86265
fix core from calling a gettime related function from a pthread when a signal handler from the main thread was in use and POSSIBLY in the same function when the signal went off. different threads should be able to access that function just fine i'd imagine.
mwells
2013-09-06 15:39:53 -06:00
7aa81abf91
use the "onsite" keyword in your url filters instead of this "only spider links from same host" switch to keep things simpler.
mwells
2013-09-06 09:37:17 -06:00
c58df10155
fix major bug causing spiders not to work.
Matt Wells
2013-09-04 11:01:24 -07:00
91c4e768b1
more family filter fixes
mwells
2013-09-01 18:28:49 -06:00
aaf333c46c
try to get family filter (&ff=1) working again to filter out adult search results.
mwells
2013-09-01 18:22:38 -06:00
afbd1e2b96
fix core from trying to get the time while in a sig handler. getTime() is not async safe.
mwells
2013-09-01 12:55:22 -06:00
93dfb0cfd4
fix for the "spiders stuck" fix.
mwells
2013-08-31 11:25:26 -06:00
5e0a53b909
minor print change
mwells
2013-08-31 10:57:36 -06:00
af46945403
show more info when dumping doledb.
mwells
2013-08-31 10:55:05 -06:00
9696c7936a
Merge branch 'master' into diffbot
Matt Wells
2013-08-30 16:33:00 -07:00
94e6492916
removed MAX_COLL_RECS so we can have unlimited collections, really limited by the sizeof(collnum_t) only now, which is 16bits, 15bits unsigned, which is the limitation. can always expand this so we can have more than 32k collections.
Matt Wells
2013-08-30 16:20:38 -07:00
f6bcaeb76a
minor fix.
mwells
2013-08-30 00:16:30 -06:00
900bbf8fba
try to fix the bug of the spiders kinda getting stuck and now spidering to their max potential because of doledb record annihilations at the top of the spider priority queue in spiderdb of SpiderRequests. was causing lots of re-reads in Msg5.cpp of doledb, like over 300 rounds, very slow.
mwells
2013-08-29 21:59:02 -06:00
2e9c8f7c6e
Merge branch 'master' of github.com:gigablast/open-source-search-engine
mwells
2013-08-29 21:17:46 -06:00
84fae9a3c6
Fix issue of reading spiderrequests from doledb at the very first key in spiderdb. causes lots of positive/negative key annihilations. we end up re-reading like 300 times in some cases just to get a url from a doledb priority.
mwells
2013-08-29 21:16:59 -06:00
ca2a024d04
fixed up thread/spider log msgs. fixed core from calling fprintf in alarm signal missed quickpoll handler.
mwells
2013-08-29 21:15:42 -06:00
e925012dce
change a couple of possible reserved names in C++ to non-reserved names. #define _ADDRESS_H_ to _GB_ADDRESS_H_ etc.
mwells
2013-08-28 22:59:01 -06:00
82ee2dfed7
fix cores when spider is unzipping gzipped web pages.
mwells
2013-08-28 22:49:22 -06:00
80179525c1
when using pthreads block SIGIO so it does not silently kill the gb process because we no longer have a handler for it because it was bogging down the cpu because it went off every time a udp datagram was sent/received and it seemed to have a ton of overhead with it. SIGIO used to be sent when the signal queue was full so we'd resort to polling the file descriptors, so i'm not sure how this will affect us. also updated Threads.cpp to use getpidtid() instead of getpid() to get the thread id when using pthreads, not the process id. using pthreads is now default behaviour even though they suck. we used to use clone() but the newer stuff doesn't allow us to override errno_location anymore.
mwells
2013-08-21 15:01:26 -06:00
6332de2daf
added link to compare.html comparison to SOLR into documentation.
mwells
2013-08-21 13:14:17 -06:00
37a6549a58
updates to developer.html developer documentation. removed a lot of obsolete information. still needs more work.
mwells
2013-08-21 13:09:55 -06:00
8971d9b932
comment our urldb from developer.html since no longer used.
mwells
2013-08-21 08:59:51 -06:00
6cf0497c2c
added a little posdb documentation to developer.html. posdb replaced indexdb as the new index because it has word position info as well as word field info.
mwells
2013-08-21 08:40:28 -06:00
a2a57addd9
try fixing the cpu being slammed in the sigiohandler. seems like signals meaning might have changed in the kernel, etc. over the years. fixed Loop.cpp.
mwells
2013-08-20 14:12:44 -06:00
a270a9bc91
updated README.md to reference compare.html
mwells
2013-08-19 17:20:30 -06:00
7d3cc672c8
use ./gb blaster -u <fileofurls> to just inject urls, but use -i to also add the outlinks to spiderdb.
mwells
2013-08-19 16:33:27 -06:00